I have a large database of documents (these “documents” are essentially web pages and they are all in HTML). They have information regarding the business itself and can contain a lot of similar information. What I want to do is to create a chatbot on top of this database that can answer any question regarding the content of its documents.
Now, if I pass the correct information to GPT, it can answer the question easily. The main difficulty is how to get that relevant information, so that it can be provided to GPT. Right now, I’m chunking the documents, embedding those chunks, storing them in a vector database and, when a question is asked, I fetch the k-nearest embeddings from this vector database (I’m using text-embedding-ada-002 to create the embedding, by the way).
So, my questions are:
- How can I create the embeddings in the best way possible, so that the information retrieval step has a high performance? (I assume OpenAI, Google, etc. did something like this when crawling and scraping the web to fetch relevant information, but I don’t seem to find anything of interest online )
- This is a little of topic, but is there a rule of thumb to intuitively understand why one embedding had a higher score than other in the k-nearest embeddings search? From my experience, I see that very small embeddings tend to be chosen with higher scores. For example, if the question is “How can I make popcorn?”, an embedding from a sentence with 10 words will have a higher score than an embedding from a chunk of text with 1000 words (even if that chunk actually answers the question)
Thanks in advance
There are lots of ways to improve RAG (Retrieval Augmented Generation) performance, and the best approach varies from case to case. OpenAI recently posted a video where one of their developers explains the process of improving RAG performance: https://youtu.be/ahnGLM-RC1Y?si=BSj4bE5LUGQ6Wjjh&t=647
For my own application, I have made huge improvements with the following techniques:
- Using a reranker (I use the one from Cohere, but there are other models).
- HyDE (Hypothetical Document Embeddings). Basically, instead of searching for your question, you have a model (e.g., GPT-4) generate a hypothetical answer. Then you use that generated answer to search the database instead.
For example, my database consists of 10,000 statistical datasets, each with a title such as “Population by region, age, and gender.” When the user poses a question, the model generates a fake or hypothetical title of a dataset that is likely to contain the information it needs to answer the question. That fake/hypothetical title is then used to search the database instead.
In your case, I think it would be very useful to write a script that extracts all text from the HTML before generating embeddings and adding it to your database.
When it comes to the embedding scoring process, I’m not an expert, but I think it works something like this:
- Embeddings are vector representations of strings. Different embedding models will generate different vector values for the same string.
- If two strings have similar meanings or content, their vectors will be closer to each other in space compared to two strings that do not. This means that even if the words are different (e.g., “apples are grown during the summer” and “The hottest season is often when pears are cultivated”), the distance between their calculated embedding vectors will still reflect the fact that the information is similar.
As for how the length (or rather the number of tokens) of a string affects the expected vector similarity, I really can’t say. But if you’re actually chunking your text before adding it to the database, I can’t see why you would even have some embeddings with “10 words” and some with “1000 words”?
I would suggest googling “cosine vector similarity” for more information on the underlying math.
Hope my answer could be of some help!
Would also suggest looking at OpenAI’s cookbooks if you haven’t already! For more information, guides and tips on RAG I have found James Briggs youtube videos and notebooks very useful
@Reversehobo thanks a lot for your answer!! I will check the material you have sent