How does the model decide what query to pass to RAG functions?

rybo · March 18, 2024, 11:49pm

We’re trying to optimize our knowledge base that’s searched by RAG to help ensure the desired results come back when given specific queries but it seems like the model searches for very different terms in different instances, even during similar conversations.

Does anyone have any insight on how the model (gpt-4-turbo-preview) decides what query to pass when it calls RAG functions?

anon22939549 · March 18, 2024, 11:53pm

It doesn’t pass “queries” it is performing a semantic search based on the cosine similarity of the embedded vectors.

rybo · March 19, 2024, 12:05am

Right but the model comes up with a text query and passes it to the RAG function. It’s up to the RAG function to convert that query to an embedding and do the semantic search. I can see the text queries being passed to our RAG function. Just not sure if there is any insight into how the model decides what to search for?

icdev2dev · March 19, 2024, 12:22am

Because the default top_p and/or temperature are too “liberal”, it comes back with different text query ; even for similar conversations.

As to what that prompt is, i don’t know. But probably optimized using something similar to DSPy.

rybo · March 19, 2024, 5:46pm

Ah I hadn’t thought about temperature and top_p impacting RAG function queries but that does make sense. We’ll experiment with temperature = 0 to see if queries become more consistent. Thanks!

anon22939549 · March 19, 2024, 9:12pm

I think there’s some confusion going on here, most likely all on my end.

I am assuming you’re using the assistants API endpoint and you have uploaded some files to use for retrieval.

From the docs,

How it works

The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:

it either passes the file content in the prompt for short documents, or

performs a vector search for longer documents

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.

From this, my understanding is that if your docs are short the model will just dumps them all into context and keep them there at all times. If you have longer documents, the model will simply create an embedding of the most recent message and performs a vector search with that. They aren’t doing anything HyDE-based or using a mixed-method search with keywords.

rybo · March 19, 2024, 9:47pm

I’m using the /chat/completions endpoint. Sorry, should have specified.

anon22939549 · March 19, 2024, 10:46pm

But, the chat/completions endpoint doesn’t do RAG. That’s an assistants thing.

Do you have an example of your code and some inputs and outputs so we know exactly what you’re trying to do?

N2U · March 19, 2024, 10:55pm

I see what’s going on here, he’s having the model write the search query. This is passed through a function call to the database that actually does the cosine similarity stuff.

anon22939549 · March 19, 2024, 11:00pm

MAYBE…

I’d like for them to be a little more explicit about that though.

N2U · March 19, 2024, 11:12pm

Yeah, I’m starting to get confused about what’s going on here as well

SomebodySysop · March 19, 2024, 11:49pm

Equally confused by this statement. But, as an exercise, let’s break it down:

Using chat completion API and a vector store for embeddings, this is the typical RAG scenario:

User asks question ----> Vector Store Cosine Similarity Search -----> Question + Search results -----> LLM (gpt-4-turbo-preview) -----> Response

So, in this traditional RAG setup, the model doesn’t “decide” query to send. It merely evaluates the question and search documents you submit to it and gives you a response based upon your system message requirements.

So the question becomes: How does the scenario you are describing differ from the above explanation?

rybo · March 20, 2024, 2:48am

Not sure how I’ve managed to confuse everyone. I am new to openai and wasn’t aware of the assistants API. What I’m working with goes like this.

User question → Chat API request → LLM calls function to get external data → Function does cosine similarity search of vector store → Results sent back to LLM → LLM generates response

Functions are provided to the LLM via the tools or functions fields in the chat request. The LLM decides what to pass as parameters to the functions so, unless I am misunderstanding something, it does “decide” what query to send. From the API reference:

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

anon22939549 · March 20, 2024, 2:56am

What is the actual function it’s calling? How have you defined the function?

Do you have an actual concrete example?

From what you’ve described the model is likely to just make up something.

rybo · March 20, 2024, 3:01am

In my case it’s a function to search our knowledge base for additional information to answer the user question. The app is a support bot and the knowledge base has things like guides, manuals, FAQs, etc…

anon22939549 · March 20, 2024, 3:16am

Then it sounds like you’ve got your answer! Good luck!

dignity_for_all · March 20, 2024, 4:37am

Assuming that the RAG functions are defined externally, it appears their purpose is to allow GPT-4-turbo to determine the query to be passed to the RAG functions.

Even without knowing what the RAG functions are, we can still say that GPT-4-turbo is prone to hallucinations.

Why not consider using the traditional GPT-4 endpoint (although the short context length might necessitate chunking) instead of GPT-4-turbo?

rybo · March 20, 2024, 2:19pm

I didn’t realize GPT 4 Turbo was more prone to hallucinations than the older GPT 4. Worth trying. Thanks!

N2U · March 20, 2024, 4:47pm

The model will sometimes take too much “inspiration” from the context of its conversation with the user, and form it’s queries accordingly. You can prevent this by sending only the users most recent message over to a separate LLM instructed specifically on how to search your knowledge database, and let that handle the function call

dignity_for_all · March 22, 2024, 3:07am

Please allow me to correct myself.

As mentioned above, calling multiple tools (parallel function calls) is not supported in the traditional GPT-4.

The term “Assistant” is confusing to me because “Assistant” is used for responses from GPT, and there is also an Assistant API.
But, now I understand that you were talking about parallel function calls from the statement “A max of 128 functions are supported.”

This is only just my opinion, but when passing all the text to be retrieved as input (without using function calls), as the context length gets longer, it seems more likely to cause hallucinations due to the limitations of attention. So, I said that the traditional GPT-4 is less likely to cause hallucinations.

I apologize for that point.

Topic		Replies	Views
RAG input via System message: JSON vs plain text API rag	20	1881	September 18, 2024
Anyone seeing degradation/spottiness of performance in the API? Since approximately 5PM EST API gpt-4o , gpt-4o-mini	6	337	September 10, 2024
New models are incapable of proper function calling Feedback	22	5599	July 17, 2024
Can I control or influence when RAG fetches external data? API rag	7	1296	April 4, 2024
Recent observations while forcing multiple tool calls in assistants api Bugs assistants-api , vector-store , file-search	2	591	November 14, 2024

How does the model decide what query to pass to RAG functions?

How it works

Related topics