How does the model decide what query to pass to RAG functions?

We’re trying to optimize our knowledge base that’s searched by RAG to help ensure the desired results come back when given specific queries but it seems like the model searches for very different terms in different instances, even during similar conversations.

Does anyone have any insight on how the model (gpt-4-turbo-preview) decides what query to pass when it calls RAG functions?

It doesn’t pass “queries” it is performing a semantic search based on the cosine similarity of the embedded vectors.

Right but the model comes up with a text query and passes it to the RAG function. It’s up to the RAG function to convert that query to an embedding and do the semantic search. I can see the text queries being passed to our RAG function. Just not sure if there is any insight into how the model decides what to search for?

Because the default top_p and/or temperature are too “liberal”, it comes back with different text query ; even for similar conversations.

As to what that prompt is, i don’t know. But probably optimized using something similar to DSPy.

1 Like

Ah I hadn’t thought about temperature and top_p impacting RAG function queries but that does make sense. We’ll experiment with temperature = 0 to see if queries become more consistent. Thanks!

I think there’s some confusion going on here, most likely all on my end.

I am assuming you’re using the assistants API endpoint and you have uploaded some files to use for retrieval.

From the docs,

How it works

The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:

  1. it either passes the file content in the prompt for short documents, or
  2. performs a vector search for longer documents

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.

From this, my understanding is that if your docs are short the model will just dumps them all into context and keep them there at all times. If you have longer documents, the model will simply create an embedding of the most recent message and performs a vector search with that. They aren’t doing anything HyDE-based or using a mixed-method search with keywords.

I’m using the /chat/completions endpoint. Sorry, should have specified.

But, the chat/completions endpoint doesn’t do RAG. That’s an assistants thing.

Do you have an example of your code and some inputs and outputs so we know exactly what you’re trying to do?

1 Like

I see what’s going on here, he’s having the model write the search query. This is passed through a function call to the database that actually does the cosine similarity stuff. :laughing:

1 Like

MAYBE…

I’d like for them to be a little more explicit about that though.

Yeah, I’m starting to get confused about what’s going on here as well :sweat_smile:

1 Like

Equally confused by this statement. But, as an exercise, let’s break it down:

Using chat completion API and a vector store for embeddings, this is the typical RAG scenario:

User asks question ----> Vector Store Cosine Similarity Search -----> Question + Search results -----> LLM (gpt-4-turbo-preview) -----> Response

So, in this traditional RAG setup, the model doesn’t “decide” query to send. It merely evaluates the question and search documents you submit to it and gives you a response based upon your system message requirements.

So the question becomes: How does the scenario you are describing differ from the above explanation?

2 Likes

Not sure how I’ve managed to confuse everyone. I am new to openai and wasn’t aware of the assistants API. What I’m working with goes like this.

User question → Chat API request → LLM calls function to get external data → Function does cosine similarity search of vector store → Results sent back to LLM → LLM generates response

Functions are provided to the LLM via the tools or functions fields in the chat request. The LLM decides what to pass as parameters to the functions so, unless I am misunderstanding something, it does “decide” what query to send. From the API reference:

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

What is the actual function it’s calling? How have you defined the function?

Do you have an actual concrete example?

From what you’ve described the model is likely to just make up something.

In my case it’s a function to search our knowledge base for additional information to answer the user question. The app is a support bot and the knowledge base has things like guides, manuals, FAQs, etc…

Then it sounds like you’ve got your answer! Good luck!

Assuming that the RAG functions are defined externally, it appears their purpose is to allow GPT-4-turbo to determine the query to be passed to the RAG functions.

Even without knowing what the RAG functions are, we can still say that GPT-4-turbo is prone to hallucinations.

Why not consider using the traditional GPT-4 endpoint (although the short context length might necessitate chunking) instead of GPT-4-turbo?

1 Like

I didn’t realize GPT 4 Turbo was more prone to hallucinations than the older GPT 4. Worth trying. Thanks!

1 Like

The model will sometimes take too much “inspiration” from the context of its conversation with the user, and form it’s queries accordingly. You can prevent this by sending only the users most recent message over to a separate LLM instructed specifically on how to search your knowledge database, and let that handle the function call :laughing:

2 Likes

Please allow me to correct myself.

As mentioned above, calling multiple tools (parallel function calls) is not supported in the traditional GPT-4.

The term “Assistant” is confusing to me because “Assistant” is used for responses from GPT, and there is also an Assistant API.
But, now I understand that you were talking about parallel function calls from the statement “A max of 128 functions are supported.”

This is only just my opinion, but when passing all the text to be retrieved as input (without using function calls), as the context length gets longer, it seems more likely to cause hallucinations due to the limitations of attention. So, I said that the traditional GPT-4 is less likely to cause hallucinations.

I apologize for that point.