I have a corpus of documents that I have broken down into chunks. Each chunk is about 20 sentences long. I also chunked these documents with a sliding window to maintain context. I used the openai embeddings model to create a vector for each chunk of text. Currently, when the user submits a query, the app will embed this query, perform a semantic search against the vector database, then provide the gpt model the top 10 chunks of text with the user query, gpt then provides an answer to the query.
My question: will embedding a query provide a vector that will return the most relevant chunks of text based on semantic search? Or is it better to modify the query? For example, should I rewrite the query to be a statement to potentially achieve better matches in the vector database?
You are on to a good idea. Text of a document might not look at all like a short question.
You can insert the text into a prompt such as “Write a two-paragraph article that expands on this subject or attempts to answer this question <user_input>.” Then you’ve got preliminary text looking more like the database (as long as the AI didn’t say “I don’t know anything about that”.
Doing any kind of AI processing on the user input will delay the answering, a negative experience for an interactive AI.
Another thing one might do is include more conversational context to be embedded. A user’s previous questions and AI responses would help if the user input is “how much is model x-35 per 1000.”
In theory, yes, BUT … if the users query has no information overlap with your content, then no, you need to transform the query.
How do you transform a query?
Well …
Try using HyDE. This is where you take the query, have the LLM answer the question, and then correlate this answer with your data.
Try a “vector redirect”. So add an offset vector to the embedded query vector, where this new vector points to a known target in the embedding space that you think is correct.
Also try keywords.
In fact, try all these, and fuse all the results using Reciprocal Rank Fusion (RRF).
“Given the following reference material, which ends with “—” as a delimiter, answer the question that appears below the delimiter. Reference Material: $referenceMaterial\n—\n$question”
Is this best practice for how to submit reference material followed by a question, so that the LLM knows which is which and can’t start interpreting reference material as requests for actions etc?
You don’t want the answer from the AI to be “thanks for providing me the reference material”.
I would push it into an earlier user or assistant role appearing in chat history, with the appropriate tone for that role, saying “[retrieved documentation for answering my query]\n\n—\n\n”. The AI sees it like it was something the user offered, or understands it was automatic, and then AI simply uses the knowledge, and the user input is only things the user actually said.
Yeah, loading up a big history of “user” messages to submit, with each being a chunk of reference material, is definitely cleaner…than trying to “engineer” it all into one single block of text. I think what I suggested will work perfectly, but it’s simpler to just use a “fake” chat history which is really always interpreted as “context”.
Edit 1:
But the “System” prompt would need to be written with something like “You are a customer service agent, and will answer questions to a customer, based on assuming the entire chat history is your reference knowledge, and not from the user.”, right?
Because if not, the LLM will basically think the customer has made all those statements which could lead to a very confusing conversation for the end user.
Edit 2:
Or…is there a limit to length of the “System Prompt”? …why not load up all the reference material in the system prompt? Then there’s no risk of the LLM thinking the user “said” all that.
That’s why I said that it be appropriate for the voice of the role. Like the assistant prefaces with “found with database search: (blah blah)”. But ultimately, [I can just dump documents] and make the AI say “OK” and it understands to use the knowledge of documents.
User: What is the Capitol of Texas?
ChatGPT: You just told me it was Austin.
User: No, I just got here....what are you even talking about?
ergo, we need the instructions and reference material either in the “System Prompt” or done the way I originally said in my first post on this thread. right?
And that wouldn’t happen anyway, because the injection is gone by the time the user can ask about it. You’d have to be tricky enough to ask a question about an embeddings that hasn’t been searched yet.
The point is trying to get an embedding that will similarity match with a vector database with better results. Even an example answer or a poorly answered question turns out to be often quite better than a no-information user question.
Let’s take a user question, and imagine we have documentation with all the answers (or a forum with answers). Which one of these is likely to get the appropriate answer from the knowledgebase if we do a semanic similarity search between two texts: the question I ask the AI to expound on,
Write a two-paragraph article that expands on this subject or attempts to answer this question: Yeah, loading up a big history of “user” messages to submit, with each being a chunk of reference material, is definitely cleaner…than trying to “engineer” it all into one single block of text. I think what I suggested will work perfectly, but it’s simpler to just use a “fake” chat history which is really always interpreted as “context”.
or the AI’s response that we use to retrieve the documentation chunk?
In the realm of digital communication and information sharing, the choice between loading a comprehensive history of “user” messages as distinct reference materials or attempting to consolidate them into a single, cohesive block of text can significantly impact clarity and efficiency. The former approach, which involves presenting each message as a discrete chunk of reference material, offers a cleaner and more organized way of conveying information. This method not only simplifies the navigation for users but also fosters a structured exchange of knowledge. By providing context through individual messages, users can easily refer back to specific points in the conversation, making it an effective way to ensure that information is readily accessible and comprehensible.
However, an alternative solution lies in the concept of using a “fake” chat history as a means of perpetually maintaining context. This approach simplifies the presentation by consolidating information into a continuous flow, mimicking a chat conversation. While this method can be effective in some scenarios, it may risk sacrificing the precision and clarity that distinct reference materials offer. The challenge with this approach lies in ensuring that users consistently perceive the chat history as context and not as part of the actual conversation. Striking the right balance between the two approaches ultimately depends on the specific context of use and the desired user experience, with each approach offering its unique advantages in conveying complex information effectively.
This is assuming that the LLM’s response somehow matches the documents/chunks in the vector database better, and doesn’t shift away from the user’s actual query. It also adds an extra layer of complexity which to me is just a different way to roll the dice.
If, let’s say that the transformation does work. What exactly does that mean? Why does a vector database with the pure intention of matching a user query to its intended document not function as-is? What can I learn from this? That the user doesn’t know what they are searching for? My chunking method is wrong? Considering that the database returns the best-matching results compared to the rest of the results, I want to retain the semantic richness of the user’s query and not try to assume anything of it.
What about the specific terminology that’s used in documents? What if some of the knowledge that the user is looking for is information that the LLM is not aware of? Such as product-specific information.
This adds so many variables to the equation that further investigations and analytics be damned.
In your example as well I would prefer the first version. Sure, it’s not perfect and “systematic”, but it’s concise, it reflects exactly what the user is looking for. In a database with a lot of nuanced information this is critical.
I do understand that it’s the answer that is in the database and that’s why it matches better even if the LLM answers incorrectly. I just don’t find it to be a viable starting point, and I definitely think that it adds such a layer of complexity that really makes it harder to understand what’s going on behind the scenes.
It doesn’t matter what would get the best Google results or what a human likes. Semantic embeddings search is a multifaceted result of an internal state of an AI after reading the text that contains the activation levels of deep relations. It might have layers like “is this more like hillybilly text or dandy fop text”.
Yes, I’m aware of how semantic embeddings work, and also of HyDE. My main point is that transforming a user’s query (especially as a starting point without even trying out the database) seems like a bad idea. It adds a potentially unnecessary layer of complexity that makes it harder to understand what’s going on behind the scenes. It also can “dilute” the meaning behind the user’s query.
It also relies on the LLM to answer the question in such as way that the answer aligns the document results better than the question. So, besides shifting and changing the prompt there really isn’t more room for improvement. If you find that HyDE is failing, what’s next?
It definitely has it’s usefulness, don’t get me wrong. Which is why I said “don’t go with it initially”.
For example, I have a very nuanced database that needs to return precise information. I heavily rely on keywords because of this (but also require semantic embeddings to understand the question). In this database are misspelled words as product names and dimensions that can drastically alter the results. This is a clear-cut case where using an LLM to alter the user’s query would be a terrible decision.
If it sounds like I’m repeating myself. I know. You didn’t address anything I said and instead went on a tangent.
Agree, the modifications shouldn’t be made unless it’s deemed necessary. Keep It Simple Stupid!
There are cases though, when your documents are super technical (legal jargon, for example) and the user’s query has a hard time lining up with the data.
So you probably need a translation layer in this case, but still, start out and see what the situation is. There is no-one that can say, yes, 100%, start out with translating the query.
It seems like you’re disagreeing with a part or all of what I said, but I can’t tell what any of your reasoning was, because you provided none, and what you did say had nothing to do with what we were discussing, as far as I can tell.
I mean, it is a valid technique among others, but also there are places like a chatbot where it isn’t appropriate due to the latency. Input transformation would be better behind an AI function call so that there is limited use and user feedback that “thinking” is going on for that particular question.
Another is more practical, and that is to embed synthetic question sets.
One doesn’t need to do vector math to combine multiple returns or sets either, you can have multiple embeddings for a chunk that all would trigger its return the single chunk. And could tune the threshold of different semantic search types…