Evaluating the effectiveness of text generation

I’m using GPT3 to generate text based on a Q&A dataset (the data is domain specific, based on data scrapped from various internal company sources). The challenge I am facing is that the quality of the output is somewhat subjective.

This makes it hard to improve the model output - I’ve easily been able to move beyond outputting gibberish to something which works reasonably well. However, I am not finding it hard to evaluating the effectiveness of minor model changes (e.g. temperature, prompt design, tweaks to the dataset, etc …)

I’m considering ‘crowd sourcing’ input from my colleagues, giving them model output (with various tweaks) and asking them to score the results. However, this has obvious limitations!

So, I was wondering if there are techniques that people have developed that make it easier to fine-tune models where the output has a subjective quality?

1 Like

Welcome to the problem of synthetic datasets!

The solution is to think of it like a GAN:

  1. Generate a synthetic dataset with one set of prompts/fine-tunes
  2. Check on the quality of that dataset with another set of prompts/fine-tunes
  3. Rinse and repeat as necessary.

I need to do that with my Core Objective Functions project, but I’m still trying to figure out step 2. To be fair, I haven’t put much thought into it because I’ve been working on other projects. I tend to bounce between projects, working on whatever I’m feeling most inspired about until something comes along to move the needle on another project. Your question here might have just solved that for me.

Here’s what the prompt in step 2 might look like:

This is grading exercise. Check the quality of the following answers. 

Example 1:
[Input sample]
Answer 1:
[Output sample]
Grade 1:
[Manually write out a grading output]

Example 2:
[Input sample]
Answer 2:
[Output sample]
Grade 2:
[Manually write out a grading output]

<<etc>>

I have achieved very good results on subjective scoring by using few-shot examples. Here’s an earlier experiment I did whereby I generated an output with one prompt and then measured the quality of the output with other prompts: Raven Context Augmentation Demo - YouTube

(back then, I called this “context augmentation”).

2 Likes