How to Reduce LLM Confidence So It Always Queries the VectorStore?

I’m building a project that uses a VectorStore, and I’m running into a problem with model confidence.

The issue is that the LLM is often too confident in its own internal knowledge. I want it to be less confident so that whenever it has even a small doubt about some information, it will proactively query the VectorStore instead of relying on what it “already knows.”

For example:
Let’s say I’ve added the Elon Musk biography to my VectorStore, and I want the model to answer questions strictly based on that book. However, because the model already knows a lot about Elon Musk, it frequently answers from its internal knowledge and doesn’t bother querying the VectorStore to confirm the information.

I’d like to know if anyone has faced this issue and what approaches worked to make the model more likely to rely on the VectorStore and less on its own prior knowledge.

Are there recommended prompting techniques, system instructions, retrieval settings, or other strategies to force or strongly encourage retrieval-first behavior?

I’m tagging the API here because thinking about it, it seems that it should be a parameter.

Any insights would be greatly appreciated.

1 Like

So if you pass in developer instructions (or in prompt):


Instructions for Answering User Queries

  • Each response generation to a user query must include the following two aspects:
    • Normal response from internal knowledge (training data).
    • Tool_call to vector store with an appropriately synthesized query parameter based on users original question/discussion, and then an analysis of results from the vector_store.
  • Each response must then include:
    • A synthesized summary of both training-data response and vector-store response, with appropriate in-line citation and quoting where possible, indicating where confidence and data is coming from when possible.

However the question is, if you are using over API, what is your tool-chain/re-prompt workflow look like?

I.e. are you:

  1. Calling the agent with the query and then expecting it to do a tool_call, and also expecting user-facing output while the tool-call runs? Or do you want this first level always behind-the-scenes and either discarded or the original LLM response remain part of the previous response ID/context window for the final output?
  2. Is your tool call routing internal to your own system or using OpenAI servers/etc.?
  3. Do you want only the final synthesized output from the above steps, or are you “watching the process as it happens” and seeing each stage of the LLM building it’s knowledge?

Really all depends on how your system/API setup is configured, if you are using out-of-the-box OpenAI stuff or if you actually have a real system built locally.

You can also of course pass various tool_call parameters instructing the models frequency/reliability to make tool calls, i.e. encouraging it to always do that upon first response, etc., search the API documentation for this…