Chat history and semantic search

Hello, there?
Now I’m developing AI chatbot with custom knowledge base using Pinecone and Langchain.
The problem I faced is related to chat history and semantic search.
If user type his query, the system retrieve documents from vector database by semantic search.
For example, if user ask “What is PHP?”, the system get relevant data from Pinecone and then, ChatGPT answer based on this data.
After that, if user ask “How to use it?”, the system should retrieve data considering previous question. In other words, “How to use PHP?”.
But I don’t know how to implement it.
Now I’m using Langchain, so if someone knows this problem, please let me know.

  1. I would only use vector database retrieval to augment the oldest chat session history knowledge, by putting back user/assistant conversation pairs, in the original ordering. Keep a good portion of recent conversation as chronological turns.
  2. Embed user questions, but the data will be on the pairs.
  3. You can do embeddings on several recent user inputs as one context. That should bring back several similar user inputs. Bot responses might overwhelm what the user said and are different than user questions.
  4. The value of old stuff is overrated. Better is an AI classifier that tags new threads, roles, or identities and persists them, so if the user wants to write the rules of a roleplay game, they are always there until replaced by the user.

Thanks for your kind reply.
Could you explain in more detail?

Lets have an AI explain in more detail:

Sure, let’s break it down:

  1. Vector Database Retrieval: This refers to using a database of previous conversations to supplement the chatbot’s memory. When the chatbot’s memory is limited, older conversations may be forgotten. To mitigate this, you can store the older conversations in a database and retrieve them when necessary. This can help the chatbot maintain continuity in long conversations.
  2. Embed User Questions: This means to convert user questions into numerical form (or vectors) that can be processed by the AI. This is useful for understanding the context of the user’s questions and providing relevant responses.
  3. Embeddings on Several Recent User Inputs: This is similar to the previous point, but instead of just embedding the current user input, you embed several recent inputs. This can help the chatbot understand the context of the conversation better, especially if the user’s questions or statements are related or build upon each other.
  4. AI Classifier for New Threads, Roles, or Identities: This refers to using an AI classifier to tag new topics or roles in the conversation. This can be useful for maintaining context in complex conversations where the user switches between different topics or roles.
  5. Persisting Rules: If the user sets certain rules or preferences, these should be remembered by the chatbot until they are changed by the user. This can help in maintaining consistency in the chatbot’s responses.

Remember, the goal of these techniques is to provide a more robust memory for the chatbot, which can help it maintain context and continuity in long or complex conversations.

AI doesn’t really understand my main point though: you wouldn’t want the entire chat history to be just a “search”.

You can make a very capable chatbot with just 3-6 recent questions.

Like ChatGPT though, this can be unsatisfying when you named the AI “George” 10 turns before and it doesn’t remember. So you give a mechanism just to put a little bit of the very old stuff back that might become relevant again.

I think the simple answer to this is that ChatGPT inherently already remembers what is being discussed. So if your second message is “How to use it?” then based on the previous chat history of messages GPT will know the “it” means PHP.

The key point is that you do need to send the entire prior chat history along with each new prompt/query.

EDIT: Oh wait, I think you may mean, since you’re doing RAG, you’re asking how to keep the RAG-retrieved content from being ‘cumulative’ in the discussion, and holding in the context too much info that you no longer need as the conversation evolves.

In that case I think maybe after every N queries or so, asking GPT to “summarize the conversation so far, and be sure to include all important keywords and topics”, and then using that as the new context, might be able to keep context from snowballing. There are bound to be some “standard approaches” to this, so maybe I got out over my skis even trying to help. :slight_smile:


Just “embeddings on several recent user inputs” should solve the problem of not getting relevant knowledgebase semantic search on user inputs like “what about the other one?”

I see what you’re saying but I also think a cumulative list of at least all the ‘nouns’ (i.e. main topics) could be maintained for each new embedding search, so that every time a new ‘context’ is generated based on embeddings it remembers all the way back. For example if they start with “Let’s discuss Elvis” and then always refer to “Him” or “he” going forward you cannot afford to loose “Elvis” as an embedding keyword, as the conversation grows longer. So “recent history” won’t necessarily guarantee correct context.

You might want to consider using HyDE.

For example:

“What is PHP?” send to the raw AI and get an answer …
AI answer (not sent to user): “PHP is a popular general-purpose scripting language that is especially suited to web development. It originally stood for Personal Home Page, but now it stands for recursive initialism PHP: Hypertext Preprocessor. It can be embedded into HTML. It is used to manage dynamic content, databases, session tracking, and even build entire e-commerce sites.”

Now correlate this with your data (RAG). Get your real data (from embeddings, keywords) then generate your real answer.

"PHP is … " ← real answer from your RAG

Then user asks “How to use it?” Send this to the hypothetical plane, but this hypothetical plane contains the user original question, and your real answer.

So …
User: “What is PHP?”
Assistant: “PHP is …” ← real answer
User: How to use this?
Hypothetical answer: "To use PHP, you do the following:

  1. Install a web server: PHP is a server-side scripting language. This means you will need a web server to execute PHP scripts. Apache and Nginx are popular options.

  2. Install PHP: Once you have a server, you should install PHP. You can download PHP from the official website."

So you then take the next hypothetical answer, correlate it, RAG it, and now you have a new answer from your data, not the AI knowledge.

You keep doing this, and essentially rinse repeat. So generate hypothetical answer, lookup, generate real answer. In the history list, you sneakily swap your real answers for the hypothetical answers to maintain the correct context.

So HyDE and playing this game with Assistant (Hypothetical) and Assistant (Real) would help maintain correct context and expand your version via RAG.


Sorry, to hijack the thread, but I have been experimenting how to implement HyDE and I do not know if I am doing it right. So from what you illustrate:

User: How to use this?
Hypothetical answer: "To use PHP, you do the following:

I will get embeddings of the hypothetical answer and use it to correlate it with my saved vector data, right? Then I should get a hit/result. Then pass it to Chat completions API for summary? I am getting unsatisfactory results so I do not know if I am doing HyDE correctly.

Hyde is an awkward name and overly-specific technique that may not apply to your type of knowledge, embedding augmentation you are already including, or task you are running.

I give you ITTYBITBIG: input transformation targeting your basic ingested text by iterative generation.

Which is: Making an itty bit of input big. Do what works. I’m not going to give you rules.

The basic method is to have the AI write a preliminary answer based on what it knows, and perhaps what has been retrieved previously. That requires chatbot conversation context of the full quality to get the full quality hypothetical inference.

A preliminary answer even if the AI doesn’t know your company or your game world or whatever else can be very like the knowledge you seek to retrieve via semantic search.

Then you can go farther (but it is harder to do economically with today’s gpt-3.5-turbo that can’t follow instructions): instead of just the answer to free-form user questions, you can insert post-prompt “write full documentation about this subject”, “After answering, give five topic keywords about your answer”, “Also give me the chapter name if your answer was a chapter in a book”.

The goal of these examples that write more about user input is to match with similar augmentations you may have added to your embedding already: the section of the white paper, the title of the article, the paragraph summary of the whole source, the keywords extracted about the section that are specific to it beyond the whole article. The additional generation by AI that’s been embedded will be specific to your data and envisioned use, and so shall your ITTYBITBIG technique.

HyDE is a synonym generator, and also a steering engine. People really don’t understand HyDE very well in this context, so let me give an example.

Suppose you have a business (Platcorp) that offers cloud computing products. You have multiple offerings and you want to land in N different offerings.

So here is how you steer it to offering 1 (of N). Your “serverless” product line. Where you steer it in this direction in the System message:

System: Generate the following answer from the perspective of how it can enable the user to develop on our platform.

Our platform is a serverless and database driven solution that scales to infinity with zero lag.

Now drop in the user:

User: What is PHP?

Now run search on this generated chunk, and the other N projections, and the original query (N + 1 total).

You now retrieve your top hits, feed this into the LLM as context, and drop the original user question in again to generate your final answer.

Send this to the LLM, and get response:

Assistant: PHP is a server-side scripting language that you can use on our platform to develop web applications. You can utilize our serverless and database-driven solution to create scalable applications without worrying about server management or latency. We’ve made sure that even when the load of your application increases, our platform scales automatically, providing seamless performance. Essentially, you can focus on the development part while we take care of the maintenance, access, and scalability, thus ensuring smooth, uninterrupted development experience with unlimited growth potential.

Suppose this correlates well in your database with this chunk:

PHP is a popular open-source, server-side scripting language that is widely used in web development for creating dynamic web pages. In the context of Platcorp, PHP is used as a part of its offering to developers. The Platcorp platform provides a serverless architecture and database driven solution that allows developers to create, debug, and deploy PHP applications efficiently. Since the platform manages server infrastructure, developers can concentrate on writing applications in PHP rather than managing systems, which promotes faster development and deployment. The platform is also scalable and can handle significant traffic increases, ensuring a smooth user experience with zero lag.

Then you send another query to the LLM with this System:

Generate an answer from the following context. Mention how it relates to Platcorp if applicable.


PHP is a popular open-source, server-side scripting language that is widely used in web development for creating dynamic web pages. In the context of Platcorp, PHP is used as a part of its offering to developers. The Platcorp platform provides a serverless architecture and database driven solution that allows developers to create, debug, and deploy PHP applications efficiently. Since the platform manages server infrastructure, developers can concentrate on writing applications in PHP rather than managing systems, which promotes faster development and deployment. The platform is also scalable and can handle significant traffic increases, ensuring a smooth user experience with zero lag.

And this User:

What is PHP?

To get this final Assistant response, which is the only thing you send back to the user:

PHP is a popular open-source, server-side scripting language that is widely used in web development for creating dynamic web pages. In the context of Platcorp, it is used to enable developers to create, debug, and deploy applications efficiently without having to manage server infrastructure.

So you are steering and correlating to focused objects in your database, and limiting LLM drift and hallucinations that have nothing to do with your intended messaging or offerings.

Like I said above, the context is managed based on the real inputs and outputs in the final LLM call to maintain correct logic and history across time. The hypothetical embeddings (HyDE) are only internal for steering and not shown to the User. So in the final response above, this becomes an “Assistant” message in the official message history array, and all the HyDE stuff is completely hidden to the user.

PS. This is all part of my HyDRA-HyDE RAGamuffin stack I am developing. :scream_cat: :snake:


(Hypothetical) (document) (embeddings): create a plausible AI answer for the embeddings model instead of embedding on user language. AKA if the database has answers, they are going to be more similar to other answers than questions.

Yes, you certainly describe something different than the 2022 paper…or a demo of its hypothetical answers vs knowledge. And logic connected to imagination lets us know we can do better.

Different than the original paper or not. I hope the idea is pretty clear. Which is, you are aligning the query to your data using steering by the LLM.

PS. The original RAG paper used a fine-tuned seq2seq model :rofl: But hopefully the higher level concepts are soaking in, and can be used in the current modern context.

Thanks for your reply.
So, HyDE is the best choice.

You say in the first post you are “developing a chatbot”.

Having an AI write a whole response the user won’t see, and then running embeddings search on that, can cause an intolerable delay for the end user.

Better is to have your chunks also have pre-generated hypothetical questions. Search your database on just a bit of context, trying to match those questions and adding them to the score to the full data can get you a full return with just the single user data input embedding.

1 Like

Based on your original question, the users questions are generic, which is typical, and not aligning with your data, right?

So the solution, is to use the LLM to steer the question into multiple aspects of your data, then run search, retrieve the top K chunks, and have the LLM respond with your data.

If there is a followup question, feed the previous history, and continue steering the followup. The history will maintain context, and the steering will keep your correlations high and hopefully on-message.

The steering can be on transforming any input (question or otherwise) from the user into your data.

Also, you want to quality check your answer based on previous history.

Every muffin has a top and a bottom. The top of the muffin consists of all the background projections (steerings) into your data. (This is the Hydra, the many headed monster/beast from Greek mythology).

Then there is the RAG part, which includes final answer generation, or list of candidate generations.

Finally, there is the bottom of the muffin. What is this for? Well, this is checking the quality of your answer given prior “approved” answers. It uses embeddings and closeness to give the confidence factor. The theory is that the input/output is a continuous function … so close inputs correspond to close outputs. So you use embeddings to validate this input/output closeness relationship based on prior expectations.

This is Hydra-RAGamuffin!!! (said like THIS IS SPARTA!) :image of ragamuffin cat with multiple heads:


This is good advice if latency is top priority. Hydra-RAGamuffin has fairly lax latency requirements.


One thing to consider is “standalone queries”.

Exactly this. So ideally you always want to decontextualize & distill the query: strip the noise, neutralize the tone, and keep the exact words that were used to respect the nuances (using GPT).

A HUGE benefit to this as well is that you can “fix” certain misconceptions. For example, when people use incorrect names or terms for products you can let GPT know to correct this.

Hell yeah. So might as well decontextualize the text while you’re at it.

For your question:

“What is PHP?” → Easy
“How do I use it?” → Converted to a neutral standalone is “How to use PHP?”
“Yeah, my day has been pretty good. How’s yours? Man, I have been lookin at your site and wondering HOW TF DO I USE PHP?” → Converted to a neutral standalone “How to use PHP?”

Then you’ll need some juicy extra logic to handle ambiguous/strange/combination queries. I think the best bet is have GPT always ask for confirmation when it isn’t certain what the user is referencing. I mean, that’s what we do as well. Then you can also split the queries if they are combinations and query them separately.

For a chatbot which takes in all sorts of crazy shit I don’t think HyDE would be a good fit unless you wanted to clean the query AND then reflect it with GPT, which as mentioned is now a minimum of 3 GPT calls

The Hydra still works off of this, yields this hypothetical generated chunk:

I’m glad to hear that you’re interested in using PHP on our platform! PHP is a popular and versatile programming language that you can use to develop dynamic and interactive web applications. Whether you’re a beginner or an experienced developer, our platform can provide you with the tools and resources you need to get started with PHP development.

Here’s how our platform can enable you to develop with PHP:

  1. Serverless Architecture: Our platform follows a serverless architecture, meaning you don’t have to worry about managing servers or infrastructure. You can simply focus on writing PHP code and let us handle the rest. This allows you to develop applications more efficiently without the hassle of server maintenance.

  2. Scalability: With our platform, your PHP applications can scale to infinity. As your user base grows, our infrastructure automatically scales to handle the increased traffic. This ensures that your applications can handle any load, even during peak times, without any lag. This scalability eliminates the need for you to worry about infrastructure limitations and allows you to focus solely on your application’s functionality.

  3. Database Integration: Our platform provides seamless integration with various databases, including MySQL, PostgreSQL, and more. You can easily connect to and interact with your database using PHP. This allows you to store and

This, in theory, could also correlate with your standard answer, and give your standard response.

Assistant response out of the RAGamuffin from the “HOW TF DO I USE PHP?” question.

At Platcorp, using PHP is simplified through their provision of a serverless and database-driven architecture. This system allows you to create, debug, and deploy your PHP applications without the concern of server management. The scalability of the platform also ensures that it can handle increasing traffic effectively. As a result, you can focus solely on writing your PHP applications. This structure makes PHP a more accessible and straightforward language for web development.

The RAGmuffin guides the input, correlates, and checks the output. From all kinds of crazy queries.

Sorry, have to: :rofl:


1 Like

It doesish, but is it as effective as such a simple transform to “How to use PHP?”.

EDITED OUT. Did not realize that the context is a double dipping.

One of the most important things to consider, in my opinion is that this distilled query can be additionally worked on. Simpler analytics, groupings of questions, does not drift. If the preferred document(s) aren’t returned then work can be done.

If HyDE fails as well, what’s done? Match the document to whatever GPT is saying for the moment?

1 Like

It’s data driven. So if your data answers the question, so does the RAGamuffin.

For example, here is the answer given that a tutorial exists in your database.

To use PHP, you first need a text editor like Sublime Text or Notepad++ and a local server environment like XAMPP or MAMP. Start by creating a PHP file which should have an extension of .php. In PHP, all scripts start with <?php and ends with ?>. A simple PHP code to print “Hello, World!” would look like this:

echo "Hello, World!";

You can save this file as ‘helloWorld.php’ in your local server’s root (htdocs for XAMPP). To see your output, start your local server and type localhost/helloWorld.php in the URL bar of your web browser.

PHP variables are represented by a dollar [‘$’] sign followed by the variable name, and they don’t need to be declared before adding values. Conditional statements and loops can also be utilized to perform different operations based on conditions, or repeat a block of code a set number of times.

Remember, make sure to pay attention to details as PHP is case-sensitive!


1 Like