I’ve built this:
- Go through the entire book, building a list of snippets.
- Upload the list of snippets to GPT-3.
- Take a user question.
- Run Semantic Search to get the top 3 matches.
- Run Completion to generate an answer based on the top matches.
I abandoned the project:
The problem is that you can never know when an answer is actually based on content from the book or when it is a pure confabulation of the model’s internal parameters.
As you can see in my outline above, I’m not even using the “answers” endpoint. That’s because the “answers” endpoint doesn’t allow you to engineer the prompt used for completion. By using the generation endpoint, I can tweak the prompt. That means I can, for instance, tell the model explicitly to not make up stuff. Or I can throw in some sample completions before.
So I’ve done all that. And I’ve also run tests with the different settings (temperature, etc.). But I just couldn’t solve the problem.
Basically, the only way to eliminate the problem of confabulation is to reduce the model’s flexibility to the point where it only outputs the snippet found by search verbatim. In that case, you don’t need a generation step at all. And you’ve basically built a semantic search engine. And since you must build it with the smallest model, ada or babbage, for cost reasons, there’s really no reason to use GPT-3 at all.
And as soon as you make the model flexible enough that it can transform the top matching snippet into an actual answer to the question that was asked, you introduce the risk of confabulation.
The thing is:
We want to use AI to reduce bullshit on the web – not to add to it. And if GPT-3 gives you three answers from a book that are spot on, and then simply makes up the next one with no basis at all, you’ve got a very dangerous system. You just can’t trust it.
I ran it on the book “Mouth Care Comes Clean” by Dr. Ellie Phillips. I used that one because I had read it, and so was able to judge whether or not a response was likely to be a good answer or not.
Sometimes, the answers were really enlightening. And it was super-cool to be able to just ask questions to the book!
But then, for instance, I’d “ask the book” what toothbrush the author recommends. And the model spit out some brand name. I don’t remember what it was - “Oral B Sensitive Plus”, or something. And I went back to the book and searched for these terms. And there was literally nothing in the book that came even close. So the model had simply made that answer up.
I was really hopeful when I saw that OpenAI had introduced an “answers” endpoint. I thought they had solved that problem. But, last time I checked, the “answers” endpoint suffers from exactly the same problem. Maybe even worse, because you don’t have control over the prompt design.
There’s a reason that most successful GPT-3 apps to date are in the realm of marketing:
You just can’t trust that the output is based in truth.
If anybody finds a solution to this problem, please share it. That would really take GPT-3 to the next level, and make it useful for apps that rely on truth.
P.S. Here’s a screenshot of my app. As you can see, I have disabled the “Completion” step alltogether and have the model simply output the top 3 matches from semantic search.
Since it’s not the kind of “Ask the Book” app I envisioned, I’ve never published it.