How to improve quality of answers in Question Answering

Given a summary of about 1000 tokens, what is the best way to validate if the answer generated from Question Answering are correct.
We ran a test by submitting about 5000 prompts on davinci, requesting an answer for a question relative to the summary. We find that about 78% of the answers coming acceptable while others are either wrong or irrelevant. We use semantics search to filter out the irrelevant answers and still has about 8% of answers are relevant but incorrect.

Any suggestions on what other processing we can do to filter out remaining 8% incorrect answers with human validation?


Hey Nelson! To make sure I’m understanding, is the goal to generate an answer to a question based on information in a summary?

Hi Luke, thanks for responding. That is correct.
If you will be kind enough to try this example, you will see what I am talking about…

  "prompt": "Santa Claus, also known as Father Christmas, Saint Nicholas, Saint Nick, Kris Kringle, or simply Santa, is a legendary character originating in Western Christian culture who is said to bring gifts on Christmas Eve of toys and candy to well-behaved children, and either coal or nothing to naughty children. He is said to accomplish this with the aid of Christmas elves, who make the toys in his workshop at the North Pole, and flying reindeer who pull his sleigh through the air.\nThe modern character of Santa Claus was based on traditions surrounding the historical Saint Nicholas (a fourth-century Greek bishop and gift-giver of Myra), the English figure of Father Christmas, and the Dutch figure of Sinterklaas (also based on Saint Nicholas).\nSanta Claus is generally depicted as a portly, jolly, white-bearded man, often with spectacles, wearing a red coat with white fur collar and cuffs, white-fur-cuffed red trousers, red hat with white fur, and black leather belt and boots, carrying a bag full of gifts for children. He is commonly portrayed as laughing in a way that sounds like \"ho ho ho\". This image became popular in the United States and Canada in the 19th century due to the significant influence of the 1823 poem \"A Visit from St. Nicholas\". Caricaturist and political cartoonist Thomas Nast also played a role in the creation of Santa's image. This image has been maintained and reinforced through song, radio, television, children's books, films, and advertising. \n\nQ: Was Thomas Nast an advertiser?\nA:",
  "temperature": 0,
  "max_tokens": 100,
  "top_p": 0.25,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "stop": ["\n"]
1 Like

The answer I got back was “Yes, he was” which is wrong.
I think the main problem is that Thomas Nast played a role in creating Santa’s image, and the image was used for advertising.
It’s understandable that GPT3 won’t get every answer correct, but if I can do another fact check and remove incorrect answers, then that will resolve my problem.

1 Like

Facts are always based on assumptions, and so in a way, you’re getting GPT-3’s version of the facts which is really always going to be an opinion. So in cases like this, it’s useful to probe how GPT-3 arrived at that answer. As usual, the argument it uses is quite an interesting one, and I’m willing to buy GPT-3’s version of the facts in this case… Here’s what I got at the settings you used when I went on to ask it about the answer:
Q: Was Thomas Nast an advertiser?
A: Yes, he was.

Q: Why was Thomas Nast an advertiser?
A: Because he was a cartoonist.

Q: Are all cartoonists advertisers?
A: No, not all.

Q: Then what makes a cartoonist and advertiser?
A: The desire to make money.

Q: So is every cartoonist that gets paid for their work is an advertiser? Or are they only an advertiser if their work is trying to sell a particular product or service?
A: It’s the latter.

Q: What particular product or service was Thomas Nast trying to sell?
A: Santa Claus.

Q: It sounds like you consider Santa Claus to be product or service? Why is Santa a product or service?
A: Because he’s a brand.


I liked the “santa claus is a brand” …I’d never thought of it quite that way and sadly realize its true.


Which model are you using? I find davinci instruct is best for these sorts of ‘closed domain’ Q&A tasks where all of the relevant information required to answer the question is in the prompt. I’ve had some success with this example, although I haven’t tested rigorously: OpenAI API.

Let me know how it goes!


Hi Luke,
Thanks for providing the sample for closed domain.
I really like how it reply “I don’t know” when the probability is not high enough.
I’ll do further testing on this, much appreciated.