Maintain context in case of RAG (semantic search) when user query referes back to the previous conversation - for example: "How much does the second option cost?"

How can you maintain context when you have large source text, and you have user query which refers back to any part of the previous conversation, but you have to perform semantic search and you can’t use chat history directly? I tried some techniques for example what Lang Chain offered rewriting the query or using Spacy libraries but non of them worked reliably for me.

1 Like

Welcome to the community!

I’m slightly confused by the question here. The context is the chat history. When you’re talking about large source texts, are you dumping the entire contents of that text per query? What exactly is your setup here? Referential language isn’t typically a problem, it’s the specificity of that language that’s typically missing or misaligned.

3 Likes

Hello Macha, I truly appreciate your help. At the moment I am experimenting with a conversational chatbot, when having a large source text file with hundreds of pages, and you can have questions about its content and the chatbot should answers to them on the basis of the source content. As it is a big file beyond the available gpt context token, the bot have to do semantic search everytime when having a query. It is not a problem until the question contains all relevant parts explicitly I mean noun concrete object , not pronoun etc. “How much does the Les Paul Gitar cost?” But in real life conversations you can face questions when they just refer back to an object in a previous conversation. For example : “How much does it cost?”, or "Can you recommend a similar solution?, "Is the third option the best one?"etc. If I do semantic search with such questions the paragraphs extracted won’t be appropriate for generating acceptable answer by the bot. The query first should be reformulated in order to get the relevant parts in the source file. But I haven’t found solution yet to this problem.

There are two ways to do this. Number one is, as @Macha suggests, sending the chat history with every request. The second is one I’ve used for some time now, and that is generating a Standalone question to send as your prompt: How to construct the prompt for a standalone question?

And, of course, using both methods together.

This is the classic use case for sending the Chat History with your requests.

That would be the Standalone question. That is, a question which contains the context of the conversation but can stand on it’s own. Combine that with the actual Chat History, and that should solve your problem.

2 Likes

Interesting technique… one tweak I’d probably make is that instead of passing your standalone question to the model as the prompt I’d just use it to query the vector store. You don’t really want alter the flow of the conversation history, what you want to do is create a sliding window over the grounding data needed to answer the next question. I’ll give you an example…

Let’s say you have this sequence of questions (I call them goals): “create an org chart for the sales department ”, “can I get that as a markdown table”, “drop the email column and add the engineering department”. This is actually one of the test cases for our system…

It’s obviously important that the conversation history be preserved as is. There are a lot of nuances in the sequence. What you want to have happen is the support data needed to answer the next question (achieve the next goal) to shift as needed. This standalone question idea seems like an interesting way to achieve that

2 Likes

We’re using something I refer to as “rolling context”, implying for each question I’m doing VSS lookup, associates some 2,000 to 4,000 fresh tokens with the messages as new context relevant for the last question. Combined with doing threshold during matching, this results in that the user can ask questions such as “what about the second option” referring back to an answer from maybe 5 to 10 messages back in the history.

Then as we start using too much context for OpenAI to be able to handle it, we “prune” messages from the beginning, resulting in that earlier parts of the context drops out as the user keeps on asking additional follow up questions …

It’s a drag to calculate, since you need to calculate the total context size, and subtract your request tokens, and if it overflows “prune” older messages until you’re below the max threshold - But it keeps the conversation highly “fluent” in nature and its flow feels very natural …

Most of our customers goes “wow, this is the best one I’ve ever tried” as they contact us at least - Which counts for something I guess :slight_smile:

You can try it out here if you wish https://ainiro.io

1 Like

@banaiviktor I was going to ask if you’ve tried simply adding the conversation history for the last few questions to your embedding vector? Ignore the answer part of the history for a moment, if you take the last 3 user questions and generate an embedding for that, the context that’s retrieved should be an aggregate of the concepts for those questions. Said another way… if the user has spent the last 3 turns talking about the same topics then you should get roughly the same context retrieved for each question. In your scenario where the user asked “how much is the second one?” You should fetch the same basic context you fetched in the previous query because your including the concepts covered by the previous query.

The standalone query idea that @SomebodySysop proposed would work better for multi-hop scenarios where you want to pivot the dataset in some way. For example if I ask “what’s the capital of Kansas” followed by “what’s their average income per capita?” I can’t use the data retrieved to answer the first question to answer the second one. I need to pivot the data.

Both of these approaches are creating a sliding window of grounding data. By combining previous questions your stabilizing your data window and slowing down the rate at which it changes. By creating a standalone query your pivoting the data window to support multi-hop queries. As @SomebodySysop suggests the best approach might be combining techniques

1 Like

Which is why I use the question and chat history to create a query “concept” which is what is sent to the vector store to retrieve the context documents.

1 Like

Which is precisely the goal of the standalone question. Here is the video I did on the subject: https://www.youtube.com/watch?v=B5B4fF95J9s

It re-works the next question to include the context of the conversation. So, it would send something like: “what’s the average income per capita for (the capital of Kansas)?”

Thanks your replies so much. (additional info is that I am trying to create the chatbot in Hungarian language, not in English)

Actually I tried to append earlier responses, or questions to the actual query which refers back to earlier conversation, before sending the input to the chatgpt model. In this way the quality of the retrieved information depended on what retriever technique I used.
Now I am using flashrank ranker=Ranker() as it is quite fast and doesn’t require so powerful hardware, but the result is not so good, if I use more powerful version: ranker = Ranker(model_name=“ms-marco-MiniLM-L-12-v2”, cache_dir=“/opt”) the retrieved paragraphs can serve little better base for the gpt model for creating good response , but it is very slow and hardware intensive.
Earlier I tried: bi_encoder = SentenceTransformer(‘paraphrase-multilingual-mpnet-base-v2’) and cross_encoder = CrossEncoder(‘nreimers/mmarco-mMiniLMv2-L12-H384-v1’) with chromadb or pinecone, but the retrieval speed was very slow and precision wasn’t good.

I also tried to reformulate the query to a standalone question on the basis of the chat history using chat gpt 3.5 turbo and 4o mini. Sometimes the reformulation was excellent , sometime quite bad, in spite of the fact I tried to write many versions of instructions. As I understood it, you reformulate the referential query in multiple steps right?

So until now I couldn’t reach a reliable solution which works more or less well say in 90% of the cases.

I am so interested in this topic, so if one of you are ready to provide for me a consultation on this topic of course not free of charge, I would be really - really grateful.

2 Likes

Did you try with regular 4o, or even something like claude? For reformulation like this, you would likely need a little bit more reasoning power than what 3.5/4o mini can provide.

The problem with referential queries like what you’re describing is that they can be ambiguous, especially if the conversation extends to any length beyond like, 3 queries. “Tell me more about the second option” could mean:

  • the second suggestion in the conversation
  • the second value in a list

If there’s more than one list in a conversation, this could also lead to ambiguity.

Other posters suggested things like pruning and reformulating a specific query or the content to implement, but I still don’t see why this can’t be solved by tweaking the prompt to something more detailed than saying “that thing over there”.

Since we’re talking about extracting stuff, why not parse a generated list (since it’s likely already formatted in md) and save that as a seperate entity linked alongside the conversation? You can vectorize the list object (not the entire response), and retrieve that when relevant. If there’s more than one list, you can feed the user’s query and the list objects to the smaller model like you were doing and go “pick the most appropriate one” and feed the chosen list + user query to the actual larger model.

That is the best workaround I can think of programmatically. But being honest the simpler solution is to just use less vague language. Models have never liked vagueness.

4 Likes

Since I heard that you’re building a chatbot in Hungarian, I’m curious - are you able to accurately extract named entities (such as proper nouns or specific terms)?

Regardless of the method used to rank related documents, if named entities are not accurately extracted, the model may struggle to correctly reference those entities.
This also means that it depends on the model’s ability to understand context.

For example, consider the term “Margitsziget”.
It could be interpreted either as the name of an island or as “Margit’s sziget”. (I don’t understand Hungarian myself, so I asked ChatGPT about Hungarian named entities).

You mentioned that you tried Spacy but did not get reliable results - have you tried training Spacy specifically on Hungarian named entities?

Also, after extracting the named entity, you need to enclose it in double quotes so that it is recognized as a single expression.

If there are no problems with the named entity extraction, another approach might be to record what “it” refers to as a topic.

Either way, achieving 90% accuracy is quite a challenge.

5 Likes

Thank you Macha so much :slight_smile: , I haven’t tried Cloud or gpt 4o due to higher cost, but I will try it out. And to avoid using gpt 4o or Cloud on every query I will do training to detect queries with referential language and in those cases I exploit their power for reformulation, in other cases I’ll use mini or 3.5 turbo. and I tried extraction but unfortunately for Hungarian language I haven’t found reliable NER tool, but Dignity_for_all mentioned in its post a promising possibility.

1 Like

Hello Dignity , thanks your reply :slight_smile: Yes I am struggling extracting named entities properly at the moment in Hungarian, but I like the idea training Spacy on Hungarian named enities, I will try it out, truly appreciate the documentation you sent.

And at the moment I have found a transformers model trained on Hungarian language which extracts named entities quite well. :slight_smile: (huggingface, NYTK, named-entity-recognition-nerkor-hubert-hungarian).

1 Like

Thanks Thomas your valuable insights and solutions! I like them so much!!!

1 Like