Just released a new framework/dataset for critiquing GPT-3 output

mbforbes · July 6, 2021, 9:53pm

Hi all,

I thought I would share our project here in case it might help or inspire someone working with GPT-3.

We came up with an error analysis framework and annotated about a thousand GPT-3 generations with it.

People use 10 categories to mark bad spans (sequences of text) in generations. For example, here’s one paragraph we annotated ourselves:

We wrote a paper with way more details, including how these errors change as you make models bigger—from GPT-2 to GPT-3 and even human-authored text. We go into some detail about how the kinds of errors change when you vary GPT-3’s top-p, temperature, and frequency penalty, which might interest folks as I know there’s a lot of discussion about decoding configurations.

We also released the dataset and annotation tool for free, which you can get from the project webpage.

(Apologies if this comes across as advertise-y—mostly have been lurking on forum so far, but it seemed like this would be right up some folks’ alley!)

I’m happy to answer any questions I can.

Max

NimbleBooksLLC · July 7, 2021, 9:54pm

This is very helpful. I have spent a lot of time editing GPT-3 output recently for a forthcoming novel written using GPT-3 by @MarcStrassman and I can verify that GPT-3’s lack of consistency generates a lot of minor errors.

I’ve also noticed a few odd tics, for example it uses the & as shorthand for “and” far more than humans do in writing (almost everyone knows to spell this out for publication). Not sure why this would be so – maybe & is overweighted in the corpus thanks to all the code and slang!

PaulBellow · July 8, 2021, 5:11am

For fiction, I’ve noticed that the prompt is important. One typo or wrong word and GPT-3 will happily emulate that “style”… Also, I’ve found it helpful to prepend a bit about characters (wish I could do more)… just so GPT-3 doesn’t introduce new ones if I don’t want them to… If you have the characters in the scene that’s serving as the prompt, it’s not as bad at making up new people. The old adage is more relevant than ever - Garbage in / Garbage out…

We need to start a thread for fiction writers!

mbforbes · July 8, 2021, 4:52pm

@m-a.schenk you understand correctly, and great question. We pondered over this (prompt engineering) a lot, but ended up deciding to not go down the rabbit hole, even though I agree it’s super important in the long run. The main reason was simply budget constraints: annotating errors for the different models, and the decoding parameters (top-p, temp., freq. penalty) already gave us so many configurations (and we didn’t even use presence penalty!), and adding multiple engineered prompts to the mix would have scaled up the cost by another multiplicative factor based on how many we chose. A few other reasons were: that one could argue we should then have done prompt engineering for the other models as well (though this is less studied for the GPT-2 era); and the fact that prompt engineering for GPT-3 is still such a moving target.

For the second point, I’m encountering a syntactic ambiguity with “a small number of prompts are repeated” — whether you mean we took just a few prompts and repeated them, or whether the prompts were mostly unique, but only a few were repeated. In any case, the latter is true: almost all of the prompts are unique. I think either way would be fine here: repeating the prompts controls for another variable, but at the cost of exploring less of the “space” (of language), and that an annotator may potentially see the same prompt multiple times and enter into a “comparative” mode if they remember a previous continuation. But I think it’s a good idea, and would also be worth studying!

Yes, the annotation tool (and dataset with annotations) is released on the project webpage linked above! We just packaged up the tool to make it easier to download this morning—there should be a new link at the top.

Thanks so much for your questions, let me know if I can answer more or clarify anything!

mbforbes · July 8, 2021, 4:56pm

@PaulBellow super interesting observations! I wonder whether there’s a way of priming the number of characters GPT-3 will use in a passage—without being so explicit that it going into “playwright mode” or something. E.g., you introduce 3 characters and don’t want it to introduce any more, or you introduce 1 and want it to add 1 more, etc.

Also, I kind of think of you as a GPT-3 celebrity since you always appear in the community digest OpenAI sends out

mbforbes · July 8, 2021, 4:58pm

@NimbleBooksLLC woah, a GPT-3 novel… and totally know where you’re coming from with the minor errors. They can be tough to spot, too, if they’re “commonsense”-style errors rather than grammar or typo issues.

PaulBellow · July 8, 2021, 9:54pm

Moon Wars was written with GPT-3 assistance…

PaulBellow · July 14, 2021, 7:03am

More a hermit than a celebrity haha but thanks. The idea for me is to loop it and keep it going for longer works. Even double or triple what we have now would work wonders.

Also, with the new fine-tuning, I wonder if authors with long series (12+ novels??) can fine-tune curie and get a good assistant for that particular series. Would the fine-tuned GPT-3 Curie pick up on world lore, etc?

Interesting times to be a writer!

Topic		Replies	Views
The Art of Fine-Tuning: How I Used GPT-3 to Bring My Podcast to Life on the Page Community	10	2336	December 15, 2023
New personal record - 38,000 word book written in one sitting with GPT-3 Community	18	4507	March 12, 2024
GPT3 Fine Tune Data API	18	2232	December 15, 2023
Spending my day editing a book "by" GPT-3 Community	4	674	January 3, 2024
Share: Fine-Tune GPT 3.5 16k Results Only 10 Examples Novel Outlines API fine-tuning , api , tp-1 , authors	24	4008	February 4, 2024

Just released a new framework/dataset for critiquing GPT-3 output

Related topics