Coming up against GPT-3s limits

When I first tried GPT-3 in the playground a couple of weeks ago I was shocked and amazed at how good the answers seemed to be.

But, now I’m starting to see just how limited it really is. For example I asked it to recite the first few lines of Macbeth Act 2 Scene 1. It confabulated massively.

Then I gave it a few of the actual lines and asked it what came next. Again, total confabulation.

I’m surprised at this because I thought that, amongst it’s 1.7 billion bits of information, it would have at least one copy of Shakespeare’s works.

Has anyone else been disappointed with completions after going even a little bit beyond the superficial?

What model and what settings were you using?

1 Like

Standard Playground settings.

Do you have experience on how large language models work? All it does is predict what should come next based on the prompt that you give it. It’s moving more toward an information resource, but at this number of parameters, it can falter and hallucinate.

Also, play with the settings. A 0.7 temp will “confabulate” as you say… Repeating becomes a problem at lower temps, but it might be more “fact-based” for your type of prompts.

All that said, it does have limits. Compared to Markov chains and what came before, things have been moving quite quickly, in my humble opinion.

2 Likes

@gc I understand how it can feel like a limitation when the model doesn’t recite Macbeth on command, but let me offer some food for thought that might help frame your observation in a different light.

The first morsel to gnaw on is the fact that yes, surely GPT-3 has been trained on Shakespeare, just as it has been trained on everything else from legal text (my area of interest) to reddit threads. But consider that its exposure to the Bard’s works has been in many forms in addition to the pure scripts… think critical essays, parodies, excerpts, and probably even “fan fiction” (if such a thing exists for plays). This introduces a few wrinkles into requests to “recite” a work word-for-word, because the true “version” of the text you’ve crystalized in your own mind may not be the version of the text that GPT-3 finds when it searches the millions of references to Macbeth in its long-term memory. In the legal world this is an even bigger issue, because laws change over time and jurisdiction, and the models are not necessarily trained to ask followup questions to requests that are broad or imprecise (e.g. “what’s the law on capital punishment?” Ok, but where? when? accordingly to whom?)

The second tidbit to munch on is that GPT-3 is a natural language model. It works primarily via “completions”, meaning it will continue a prompt in the way that its training would suggest is appropriate. This doesn’t necessarily translate into perfect recall, partly because of the idea discussed above that its “idea” of an appropriate way to “complete” your prompt may differ from yours, and it necessarily has to make assumptions, which can be “wrong” in the sense that it interpreted your words differently than you meant them. The other reason I suspect this doesn’t translate into perfect recall is related to a third point, which is that I believe GPT-3 produces content based more on concepts and framework than on accuracy.

This last point is purely my own speculation, but its informed speculation based on hundreds of hours of experimenting. I like to think of GPT-3’s neural pathways as channels in the “groove” sense of the word, and those channels have been constructed to process concepts in a semantically-valid way. This leads to results that can truly be delightful (‘write a criminal law statute in the style of Shakespeare’), but it also works against the idea of regurgitation, since the pathways are architected more for output that “sounds” right than output that stands up to objective scrutiny. This can certainly be maddening, and introduces an element of mistrust when you realize the model is going to package what sometimes is complete BS in an eloquent and reasonable “sounding” manner.

If you’re interested in this last idea, I’d encourage you to read a post by Mike @ OpenAI on ways to improve the accuracy of GPT-3’s results, including by using microservices trained as highly specialized fact checkers. I’ve started to use microservices (Mike calls them “discriminators”. I like to say “baby AIs”) in a regulatory interpretation context and the results so far are encouraging. Here’s the post.

Anyways, continue to experiment and don’t become too disillusioned just yet. There are lots of things that GPT-3 can’t yet “do”, but at the same time there are some very good methods to overcome some of those limitations in ways that aren’t necessarily intuitive, but are nonetheless possible and end up being a very sweet dessert.

1 Like

Hi all, thanks for the responses. Yes, I understand well the basis for GPT-3’s completions. Essentially a combination of vector similarity and probability.

So, in the case of something as well documented as Shakespeare’s Macbeth I was expecting even zero shot queries to come up with pretty accurate completions there is, after all, only one play written by Shakespeare called Macbeth and, although various modern interpretations may slightly alter the prose, it remains fundamentally unaltered.

But this is what I get back with the Playground’s standard settings:

Recite the first 10 lines of Act 2, Scene 1 of Shakespeare’s Macbeth

Lennox.
So far, and what success?

Messenger.
The king is almost wounded to the death;
And, in the fortune of my lord your master,
Methinks, is not forgot.

Lennox.
Away, good fellow,
And I’ll remember thee for my good service.

Messenger.
I had almost forgot; the queen, my lord,
Desires your presence and request to know
The cause of your abrupt departure from her.

Lennox.
Fare thee well:
I must attend his majesty’s command;
To whom I am now in ward, evermore in subjection.

[Exit MESSENGER]

Ross.
What’s the business,
Lennox, that you and the other lords
Should look so sad? The King hath on him such a countenance

This isn’t in the least bit accurate, which I find surprising.

I still think that GPT-3 is extraordinary - I’m just a little surprised to have come up against its limitations quite so soon.