I have a problem where searching within a document sometimes brings up the correct answer and sometimes the wrong answer, when there is no change in the data in the document and in the request. What is the solution? I tried with temp=0, it doesn’t work. Should I use the same Thread to which I send prompts via the Assistants API?
Great question.
Simple answer: there is no 100% confidence.
On the other hand, following can help:
- Make sure your questions are crystal clear and do not force model into hallucinations.
- Better context presented to the answering model.
- Limited number of possible answers the model can produce (yes/no/unknown etc).
- Fine-tuning on context/answer pairs to make sure model catches the answer pattern more confident.
- Use log probs to get idea of model confidence with a minimum acceptable score.
- Add complementary questions to confirm the answer (interrogation approach).
Hope that gives some ideas.
There’s always some randomness because LLMs need to be seeded with a random number. If all you need is to reduce variance, you can specify a seed yourself in Chat Completions. Otherwise, prompt engineering and fine-tuning are good possible ideas.
I like this idea too! Be aware that unless you seed the model, you may still see variance in the logprobs themselves. I have not yet tested that.
Personally, I use it on a limited number of answers the model can produce and the answers are mostly predefined tokens which do not start with the same characters so I get the log props for the first token returned by the model and compare it relatively to other options and then I accept the answer if for example the gap between the first most probable answer the next one is big enough.
That gives me an idea of how confident the model answers the question, which in my use case (legal documents analysis) may be largely enough to identify if the model is providing the right answer.
I like to revisit this paper when I’m dealing with the kind of issue you highlighted @061albert :
As @OnceAndTwice and @sergeliatko pointed out, since we are in probablistic land with LLMs, there isn’t 100% confidence. But you can try to get close to it.
We don’t know the exact nature of your documents and prompts, but you could start with prompt engineering. One technique you could explore is also query rewriting. It could just be the case that the queries are ambigious enough (or asking for too much) for the LLM to pick a different answer each time. That same comment holds true even for your source documents. To address those, you could look into better parsing techniques and using markup/markdown as well, as that seems to help LLMs distinguish between texts better.
Setting seed values and exploring the log probs in chat completions are also recommended techniques and part of the best practices listed by OpenAI, so I would suggest following those too. Hope this helps, and it might be helpful if you posted more information about your specific use case for more guidance.
Agree with most of it, it’s just that my personal liking pushes me to do the prompt engineering as the last step rather than the first.
Usually, I start by simplifying everything, making sure all is clear, executable and leaves no space for ambiguity. Often it starts at the drawing board stage of the application.
All runs on simple prompts because the workflow is broken down to single easy operations.
Then I fine-tune where makes sense.
And only if all of the above is still suboptimal, only then I dive in prompt engineering.
Why? Because prompt engineering is not exact science and very subjective (if you have better techniques to keep track of performance metrics, I’ll take them), and the whole prompt tweaking consumes way too much time for little performance improvement if compared to other techniques I already deployed (like workflow breakdown and query simplification for example).
Thanks you, @sergeliatko.
I also refer to this OpenAI optimizing guide sometimes to decide where to focus energies. What’s your take on it?
Personally I would define terms in the beginning and explain the difference between “prompt engineering” and “prompting” (where 1st is a scientific approach with measurable results and hypothesis testing, and the 2nd is more simple, “common sense” prompt composition based on clear task definition and sufficient context presence controls)…
Otherwise, the resource is perfect for starting. I wish I had such manual when I started back then.
The unreliability of LLM outputs in applications like Data Extraction bothered me, so I built a LLM trustworthiness scoring tool to deal with it that often works better than logprobs:
Hope it’s useful to you!
This looks interesting @jonasmueller! Will look into and write back in case I have questions. Thank you for sharing!
What about a multi-model or at least multi-prompt consensus approach where you have two passes at retrieving the data. If the to agree then use the result. If not have a 3rd pass to get the data. If the 3rd agrees with one then use that data. If not then an error.
While more expensive a multi-model or multi-pass consensus approach can help get accuracy and reliability up.
The other approach I have used with numbers was having the models also pull things that I could test to be true. e.g. Debits = Credits or reading three numbers e.g. a+b=c and then validating the numbers in the formula. But then you have to run another model or prompt to pull the numbers anyway.
Let us know what you decide to do!
Have the model check its own answer and ask verificataion answers yourself:
- Are you sure?
- Show me the text you based your answer on
Ideas for a prompt ( I copied this from a bloke on reddit)
(run silently, output only final answer)
- Draft answer.
- List 2–5 questions that would verify each key fact.
- Answer those questions from context_passages with line citations.
- Revise draft, dropping or flagging unsupported content; tag each major conclusion High/Medium/Low confidence.
Why start the list at 0? That’s something computers do. Humans start at step 1.
When you provide a list of things to do, usually you have your own context, and most human forget that the context they have belongs to them only it is not shared, or most likely not shared with others.
In this particular case, I gave a list of things to do and forgot that the prerequisite of having your questions crystal clear is something that might be not so obvious for the others.
So instead of retyping the whole numbers for the list (I usually use voice typing), I just added an item with index 0. A side effect of that is : if you don’t have zero you don’t start with one.
Sometimes I’m asking myself if I’m really a human lol