Response accuracy and retrieval accuracy

Does anyone know which may be an acceptable response and retrieval accuracy from DAVINCI. For example if I ask it 100 questions and it responds 80 of them correctly. Is 80% an industry standard benchmark for these models. Same question for the retrieval part, if in 100 times, 90 times it is able to retrieve the correct context, is 90% an acceptable industry standard number

I think it depends on the domain of knowledge your questions are coming from.

For general “well known” things, I would expect near 100% accuracy. Since it’s trained on the internet, books, etc. Any domain where “wisdom of crowds” is the correct answer.

For your own domain knowledge, specific facts, etc. Then it could be as bad as 0%.

For technical knowledge, like STEM (science, technology, engineering, math), it can also be really bad.

If it’s your own domain knowledge, it’s best to use embeddings to retrieve your context and let the LLM answer from your knowledge.

If it’s STEM related. You could try Chain of Thought techniques, to get it to reason each step, before supplying the answer.

For “wisdom of crowds”, just use the raw LLM.


By “industry standard” do you main AI Industry? Because each industry where AI is applied will have a very different benchmark of what is acceptable (medical vs tech bloggers for example). Also remember these aren’t fact machines that are programmed to be 100% accurate, they are language models that are often accurate. And the prompts (and embeddings) dramatically differ the output, so saying “GPT (globally) is correct 80% of the time” does not translate to your specific implementation being correct 80% of the time.

Might help to take a step back and share what you are looking to benchmark or what your concerns are.

So, I was trying to see if there is a comparison against traditional conversational AI which I think are mostly in the range of 70-80%