We’re trying to optimize our knowledge base that’s searched by RAG to help ensure the desired results come back when given specific queries but it seems like the model searches for very different terms in different instances, even during similar conversations.
Does anyone have any insight on how the model (gpt-4-turbo-preview) decides what query to pass when it calls RAG functions?
Right but the model comes up with a text query and passes it to the RAG function. It’s up to the RAG function to convert that query to an embedding and do the semantic search. I can see the text queries being passed to our RAG function. Just not sure if there is any insight into how the model decides what to search for?
Ah I hadn’t thought about temperature and top_p impacting RAG function queries but that does make sense. We’ll experiment with temperature = 0 to see if queries become more consistent. Thanks!
The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:
it either passes the file content in the prompt for short documents, or
performs a vector search for longer documents
Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.
From this, my understanding is that if your docs are short the model will just dumps them all into context and keep them there at all times. If you have longer documents, the model will simply create an embedding of the most recent message and performs a vector search with that. They aren’t doing anything HyDE-based or using a mixed-method search with keywords.
I see what’s going on here, he’s having the model write the search query. This is passed through a function call to the database that actually does the cosine similarity stuff.
Equally confused by this statement. But, as an exercise, let’s break it down:
Using chat completion API and a vector store for embeddings, this is the typical RAG scenario:
User asks question ----> Vector Store Cosine Similarity Search -----> Question + Search results -----> LLM (gpt-4-turbo-preview) -----> Response
So, in this traditional RAG setup, the model doesn’t “decide” query to send. It merely evaluates the question and search documents you submit to it and gives you a response based upon your system message requirements.
So the question becomes: How does the scenario you are describing differ from the above explanation?
Not sure how I’ve managed to confuse everyone. I am new to openai and wasn’t aware of the assistants API. What I’m working with goes like this.
User question → Chat API request → LLM calls function to get external data → Function does cosine similarity search of vector store → Results sent back to LLM → LLM generates response
Functions are provided to the LLM via the tools or functions fields in the chat request. The LLM decides what to pass as parameters to the functions so, unless I am misunderstanding something, it does “decide” what query to send. From the API reference:
A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.
In my case it’s a function to search our knowledge base for additional information to answer the user question. The app is a support bot and the knowledge base has things like guides, manuals, FAQs, etc…
Assuming that the RAG functions are defined externally, it appears their purpose is to allow GPT-4-turbo to determine the query to be passed to the RAG functions.
Even without knowing what the RAG functions are, we can still say that GPT-4-turbo is prone to hallucinations.
Why not consider using the traditional GPT-4 endpoint (although the short context length might necessitate chunking) instead of GPT-4-turbo?
The model will sometimes take too much “inspiration” from the context of its conversation with the user, and form it’s queries accordingly. You can prevent this by sending only the users most recent message over to a separate LLM instructed specifically on how to search your knowledge database, and let that handle the function call
As mentioned above, calling multiple tools (parallel function calls) is not supported in the traditional GPT-4.
The term “Assistant” is confusing to me because “Assistant” is used for responses from GPT, and there is also an Assistant API.
But, now I understand that you were talking about parallel function calls from the statement “A max of 128 functions are supported.”
This is only just my opinion, but when passing all the text to be retrieved as input (without using function calls), as the context length gets longer, it seems more likely to cause hallucinations due to the limitations of attention. So, I said that the traditional GPT-4 is less likely to cause hallucinations.