Just released a new framework/dataset for critiquing GPT-3 output

Hi all,

I thought I would share our project here in case it might help or inspire someone working with GPT-3.

We came up with an error analysis framework and annotated about a thousand GPT-3 generations with it.

People use 10 categories to mark bad spans (sequences of text) in generations. For example, here’s one paragraph we annotated ourselves:

We wrote a paper with way more details, including how these errors change as you make models bigger—from GPT-2 to GPT-3 and even human-authored text. We go into some detail about how the kinds of errors change when you vary GPT-3’s top-p, temperature, and frequency penalty, which might interest folks as I know there’s a lot of discussion about decoding configurations.

We also released the dataset and annotation tool for free, which you can get from the project webpage.

(Apologies if this comes across as advertise-y—mostly have been lurking on forum so far, but it seemed like this would be right up some folks’ alley!)

I’m happy to answer any questions I can.



This is very helpful. I have spent a lot of time editing GPT-3 output recently for a forthcoming novel written using GPT-3 by @MarcStrassman and I can verify that GPT-3’s lack of consistency generates a lot of minor errors.

I’ve also noticed a few odd tics, for example it uses the & as shorthand for “and” far more than humans do in writing (almost everyone knows to spell this out for publication). Not sure why this would be so – maybe & is overweighted in the corpus thanks to all the code and slang!


For fiction, I’ve noticed that the prompt is important. One typo or wrong word and GPT-3 will happily emulate that “style”… Also, I’ve found it helpful to prepend a bit about characters (wish I could do more)… just so GPT-3 doesn’t introduce new ones if I don’t want them to… If you have the characters in the scene that’s serving as the prompt, it’s not as bad at making up new people. The old adage is more relevant than ever - Garbage in / Garbage out…

We need to start a thread for fiction writers!


@m-a.schenk you understand correctly, and great question. We pondered over this (prompt engineering) a lot, but ended up deciding to not go down the rabbit hole, even though I agree it’s super important in the long run. The main reason was simply budget constraints: annotating errors for the different models, and the decoding parameters (top-p, temp., freq. penalty) already gave us so many configurations (and we didn’t even use presence penalty!), and adding multiple engineered prompts to the mix would have scaled up the cost by another multiplicative factor based on how many we chose. A few other reasons were: that one could argue we should then have done prompt engineering for the other models as well (though this is less studied for the GPT-2 era); and the fact that prompt engineering for GPT-3 is still such a moving target.

For the second point, I’m encountering a syntactic ambiguity with “a small number of prompts are repeated” :slight_smile: — whether you mean we took just a few prompts and repeated them, or whether the prompts were mostly unique, but only a few were repeated. In any case, the latter is true: almost all of the prompts are unique. I think either way would be fine here: repeating the prompts controls for another variable, but at the cost of exploring less of the “space” (of language), and that an annotator may potentially see the same prompt multiple times and enter into a “comparative” mode if they remember a previous continuation. But I think it’s a good idea, and would also be worth studying!

Yes, the annotation tool (and dataset with annotations) is released on the project webpage linked above! We just packaged up the tool to make it easier to download this morning—there should be a new link at the top.

Thanks so much for your questions, let me know if I can answer more or clarify anything!

1 Like

@PaulBellow super interesting observations! I wonder whether there’s a way of priming the number of characters GPT-3 will use in a passage—without being so explicit that it going into “playwright mode” or something. E.g., you introduce 3 characters and don’t want it to introduce any more, or you introduce 1 and want it to add 1 more, etc.

Also, I kind of think of you as a GPT-3 celebrity since you always appear in the community digest OpenAI sends out :joy:


@NimbleBooksLLC woah, a GPT-3 novel… and totally know where you’re coming from with the minor errors. They can be tough to spot, too, if they’re “commonsense”-style errors rather than grammar or typo issues.


Moon Wars was written with GPT-3 assistance…


More a hermit than a celebrity haha but thanks. The idea for me is to loop it and keep it going for longer works. Even double or triple what we have now would work wonders.

Also, with the new fine-tuning, I wonder if authors with long series (12+ novels??) can fine-tune curie and get a good assistant for that particular series. Would the fine-tuned GPT-3 Curie pick up on world lore, etc?

Interesting times to be a writer!

1 Like