Scoring results

How do you “score” completions, and re-run if necessary?

I use GPT-3 to create personalized sales emails. I’d like to check for low-hanging fruit, like if the completion includes the fictional data from the prompt. I’d also like to learn what higher-level checks I could run: Is the “reading level” too high? Is the grammar off? Is the sentiment negative?

That way, I could re-run the completion if the first one doesn’t look quite right. Are any of you doing this in your apps?


You can use the first output and integrate it in a second prompt which checks for specified criteria. It could maybe give the mail a rating via specified words like „Low-Quality-Output“ and „High-Quality-Output“ - if one of these is detected in the result you could have a script which automatically discards or sends the emails based on that. Not sure if that would work but could be worth a try…


Guess and check is one method, as others have stated. Another is rigorous experimentation. Zero shot vs few shot, testing robust prompts, fine tuning parameters, etc.

If you’re getting too much variance in performance then you might have the temperature or top_p too high. You might also just not have a good prompt. Transformers prove GIGO to be a law.

You’ll also want to carefully control any inputs or variables that you use to populate your prompts. Semi structured prompts tend to work best - somewhere between numbered lists and natural language paragraphs.


The salesperson has to actually send it through their email composer - they often edit before sending.

I was thinking that, wondering about a more automatic way

1 Like