Evaluating the effectiveness of text generation

colin.eberhardt · November 11, 2021, 7:07pm

I’m using GPT3 to generate text based on a Q&A dataset (the data is domain specific, based on data scrapped from various internal company sources). The challenge I am facing is that the quality of the output is somewhat subjective.

This makes it hard to improve the model output - I’ve easily been able to move beyond outputting gibberish to something which works reasonably well. However, I am not finding it hard to evaluating the effectiveness of minor model changes (e.g. temperature, prompt design, tweaks to the dataset, etc …)

I’m considering ‘crowd sourcing’ input from my colleagues, giving them model output (with various tweaks) and asking them to score the results. However, this has obvious limitations!

So, I was wondering if there are techniques that people have developed that make it easier to fine-tune models where the output has a subjective quality?

daveshapautomator · November 12, 2021, 2:21pm

Welcome to the problem of synthetic datasets!

The solution is to think of it like a GAN:

Generate a synthetic dataset with one set of prompts/fine-tunes
Check on the quality of that dataset with another set of prompts/fine-tunes
Rinse and repeat as necessary.

I need to do that with my Core Objective Functions project, but I’m still trying to figure out step 2. To be fair, I haven’t put much thought into it because I’ve been working on other projects. I tend to bounce between projects, working on whatever I’m feeling most inspired about until something comes along to move the needle on another project. Your question here might have just solved that for me.

Here’s what the prompt in step 2 might look like:

This is grading exercise. Check the quality of the following answers. 

Example 1:
[Input sample]
Answer 1:
[Output sample]
Grade 1:
[Manually write out a grading output]

Example 2:
[Input sample]
Answer 2:
[Output sample]
Grade 2:
[Manually write out a grading output]

<<etc>>

I have achieved very good results on subjective scoring by using few-shot examples. Here’s an earlier experiment I did whereby I generated an output with one prompt and then measured the quality of the output with other prompts: Raven Context Augmentation Demo - YouTube

(back then, I called this “context augmentation”).

Topic		Replies	Views
Scoring results API	3	1471	July 2, 2021
Is there a way to penalise repetition of the prompt text? Prompting	6	1875	January 4, 2024
Question generation/fine tuning API	2	826	December 17, 2023
How would you build a content improver/positive spinning engine on top of GPT-3? API	4	513	November 9, 2021
How to improve a fine-tune classifier? Prompting	10	1431	August 15, 2022

Evaluating the effectiveness of text generation

Related topics