Strategies for long RAG conversations

We are building a RAG in Discourse.

One thing we noticed really quickly is that it is critical to get the retrieval piece right. If we do not find relevant documents then there is no chance we can answer a question.

At the moment we have a very naive implementation for query.

  1. Look at previous 5 interactions user had with the model
  2. Vector similarity
  3. Stuff results in system prompt

Sadly this only works well for the first couple of interactions then it falls apart cause user may have changed subjects and now we are pulling all the wrong chunks.

I have been thinking about using 2 approaches to compensate.

  1. Langchain like “question consolidator” - look at previous N interactions and try to extract a very clear standalone question - which we can then use for vector similarity.

  2. Langchain like “history consolidator” - so we don’t send as much history to the LLM, just send it “RAG document chunks”, “consolidated history” and “consolidated question”.

My questions here are:

  • should we bother with a history consolidator or not?

  • Are there and other approaches we are missing?

  • Is “question consolidation” the right thing to do? Can the LLM do better if it is instructed to just come up with a bunch of related keywords vs a coherent question?


While it may not be very helpful and is just my opinion, I believe focusing on generating a bunch of related keywords might be more efficient than consolidating history, which may not always be necessary.

On the other hand, it’s better to ask users for coherent questions, and each different question should be treated separately.

Constantly referencing past interactions during RAG can complicate searching if the user changes the topic midway.

1 Like

Yeah it appears retrieval can be nailed with question consolidation, it works really well.


  • What is the biggest city in France
  • Paris
  • What about Germany

You need consolidation, otherwise you have no proper chance with retrieval.

While implementing one thing I noticed which is quite important is that simpler models like GPT-3 or Haiku can do a good enough job consolidating questions so allowing to mix and match models here can make a big diff.

Another thing I noticed… token counts can easily go through the roof in RAGs, especially if replaying too much history, RAGs tend to have longer answers …

I suspect this wouldn’t be that much of an issue if we had more powerful, instructable embedding models.

te3 large is showing some signs of that, but it’s still weak.


The tricky thing though is the enormous asymmetry:

A lot of these RAG usages are:

  • User types in 10 words
  • Model types 300 words.

So this instruct based model would both need an enormous context. (in 5/6 interactions I can easily get to 10k tokens)

And it would have to be smart enough to just throw away 99.9% of the tokens you give it.

My strategy here has been to:

  • Limit my “figure out what the question is” to 2500 tokens
  • Use cheaper model
  • Limit tokens on model replies to 500 and truncate.

But yeah its a shame to need an extra call here to GPT-3.5. That said one upside is that the model can reply “NO_QUESTION” to a “oh thank you bot” and then you don’t need to send all those RAG tokens to an expensive model.


Large embedding models should have the same attention mechanisms that our LLMs have.

If GPT-3.5 can create a summary of it, I don’t see why GPT-3.5E (davinci-e, rip) shouldn’t be able to embed it.

I’m currently torn about taking a look at the mistral+ models because there’s no one that offers a SaaS API at the moment, and VMs are super rare and expensive these days.

I don’t know how much access you have to compute, but if I manage to spin something up, do you want me to test something for you?

1 Like

See, with this example I think an approach to de-specify the context would be helpful.

E.g., a small model to convert this into something like,

  • What is the [[extremum]] [[place type]] in [[place name]]
  • [[place name]]
  • What about [[place name]]

This works for moving from the biggest city in France to,

  • The biggest city in Germany (What about Germany?)
  • The smallest city in France (What about smallest?)
  • The biggest lake in France (What about lake?)


The basic idea here is to strip away as much semantic meaning which might get in the way.

The other thing I might suggest is to do some behind the scenes re-writing of user messages. Basically, you’d have a list of what they actually wrote, then you would have a list of re-written messages which you actually send to the models.

For instance, in this example, you might ask a smaller model “What does the user want to know here?”

So, after you consolidate whatever you need to, the small model has the context,

  • What is the biggest city in France
  • Paris

And you ask it, given this context rephrase this user input to be a complete question which does not rely on any external information, “what about Germany?”

Ideally, your smaller model would respond with, “What is the biggest city in Germany?” Which you would then pass on to the more robust model to answer.

This keeps the number of tokens being fed into the expensive model low.


contrived but still proving me right :laughing: (salesforce mistral):


    statement_embedding = create_embedding(
        "You are a Q/A bot. You are confronted with a conversation and your job is to identify the next response. \n\n" +
        "Query:\n" +
corpus (just cities):

cities = [
(“New York City”),
(“Los Angeles”),

but it does break down.

maybe that’s just a tough question…

but it gets it wrong… or does it?

I thought LA was the second largest city. but houston has 6.8 million people? houston metro 8.2?

turns out figuring the second largest city is not exactly trivial :thinking:

but it does seem to understand the question
(added “A: The Largest City”, “A: The Second Largest City”)

1 Like

This is different from the discussion about embeddings, but I think using more low-cost models like GPT-3.5-turbo or Haiku to utilize the history of past conversations to grasp the intent of the user’s questions, condense the user’s last statement into the correct question form, and then using models like GPT-4 to structure the text with RAG, is one efficient approach to creating a Q&A bot.

yeah, regarding low cost (or openai) embeddings, te3L isn’t capable of this at all. (may need more investigation)

but it might “know” more.

1 Like

So, you mean that if embedding models were to increase in size, they would essentially be the same as low-cost LLMs?


Well we had davinci embed, but that got memory holed.

I’m starting to wonder if it was because of safety

GPT-4 embed would be amazing.


My two cents:

**Should we bother with a history consolidator or not?**
I think for our purposes, we're not concerned with the coherence of the ChatBot's reasoning and responses. Instead, we're primarily focused on how our retrieval system (e.g., vector database) can accurately retrieve the right information.
    Using a summary or consolidation seems better from a semantic perspective. For instance, consolidating the sequence:
        "What is the biggest city in France?"
        "What about Germany?"
        Into "What is the largest city in Germany?" clearly contains richer semantics than just "What about Germany", which would retrieve all information related to Germany.
    **Pros:** Enhanced semantics.
    **Cons:** Requires an additional call to the LLM.

**Are there any other approaches we are missing?**
a) You might want to employ additional filters (such as keywords) besides just vector similarity.
b) You could also leverage the knowledge embedded in the LLM to answer questions by utilizing the model's confidence level in its responses.

@sam.saffron, along the lines of question consolidation, would it make sense to first ask the Chat model what additional information it thinks it needs to answer the question(s); then, embed the answer(s) provided by the LLM? This could be done separately.

Following the example posted earlier:

What is the biggest city in France
What about Germany

The question to the LLM would look like the following (and use a Langchain-like internal dialogue/reasoning format):

What additional information do you need to retrieve to answer the user's questions below via RAG? Please output the additional information as a JSON list, containing the field "standalone_question", so that the "addtl_info" can be separately similarity searched in my RAG process, without any extra context. 

After data is retrieved for each question, the information will be provided back to you

~~~ Questions ~~~
What is the biggest city in France
What about Germany

Testing this with GPT-3.5, I received the following answer.

        "standalone_question": "biggest city in France"
        "standalone_question": "biggest city in Germany"

Answers in this format would be easy to iterate through via RAG, consolidate the info by question, and then send this all back to ChatGPT in a single shot for final processing. Alternatively, if the wrong (or not enough) info has been retrieved for answering these questions, one could set this up as a loop (with separate validation for each question) to ensure that all necessary info was retrieved before sending a final answer back to the user.

Regarding the token usage in chat completions API, one could cut down on the size of the document chunks for embeddings (particularly if the looped approach above was used).


Yes, that works. In addition to that you can ask ChatGPT for a list of similar or extended solutions…

Also old school style solutions with “intent” and “entity” might be worthy to read about.