Document Retrieval in Large Database

bruno.vaz · November 15, 2023, 12:15pm

Hey!

I have a large database of documents (these “documents” are essentially web pages and they are all in HTML). They have information regarding the business itself and can contain a lot of similar information. What I want to do is to create a chatbot on top of this database that can answer any question regarding the content of its documents.

Now, if I pass the correct information to GPT, it can answer the question easily. The main difficulty is how to get that relevant information, so that it can be provided to GPT. Right now, I’m chunking the documents, embedding those chunks, storing them in a vector database and, when a question is asked, I fetch the k-nearest embeddings from this vector database (I’m using text-embedding-ada-002 to create the embedding, by the way).

So, my questions are:

How can I create the embeddings in the best way possible, so that the information retrieval step has a high performance? (I assume OpenAI, Google, etc. did something like this when crawling and scraping the web to fetch relevant information, but I don’t seem to find anything of interest online )
This is a little of topic, but is there a rule of thumb to intuitively understand why one embedding had a higher score than other in the k-nearest embeddings search? From my experience, I see that very small embeddings tend to be chosen with higher scores. For example, if the question is “How can I make popcorn?”, an embedding from a sentence with 10 words will have a higher score than an embedding from a chunk of text with 1000 words (even if that chunk actually answers the question)

Thanks in advance

Reversehobo · November 15, 2023, 3:01pm

There are lots of ways to improve RAG (Retrieval Augmented Generation) performance, and the best approach varies from case to case. OpenAI recently posted a video where one of their developers explains the process of improving RAG performance: https://youtu.be/ahnGLM-RC1Y?si=BSj4bE5LUGQ6Wjjh&t=647

For my own application, I have made huge improvements with the following techniques:

Using a reranker (I use the one from Cohere, but there are other models).
HyDE (Hypothetical Document Embeddings). Basically, instead of searching for your question, you have a model (e.g., GPT-4) generate a hypothetical answer. Then you use that generated answer to search the database instead.
For example, my database consists of 10,000 statistical datasets, each with a title such as “Population by region, age, and gender.” When the user poses a question, the model generates a fake or hypothetical title of a dataset that is likely to contain the information it needs to answer the question. That fake/hypothetical title is then used to search the database instead.

In your case, I think it would be very useful to write a script that extracts all text from the HTML before generating embeddings and adding it to your database.

When it comes to the embedding scoring process, I’m not an expert, but I think it works something like this:

Embeddings are vector representations of strings. Different embedding models will generate different vector values for the same string.
If two strings have similar meanings or content, their vectors will be closer to each other in space compared to two strings that do not. This means that even if the words are different (e.g., “apples are grown during the summer” and “The hottest season is often when pears are cultivated”), the distance between their calculated embedding vectors will still reflect the fact that the information is similar.

As for how the length (or rather the number of tokens) of a string affects the expected vector similarity, I really can’t say. But if you’re actually chunking your text before adding it to the database, I can’t see why you would even have some embeddings with “10 words” and some with “1000 words”?
I would suggest googling “cosine vector similarity” for more information on the underlying math.

Hope my answer could be of some help!

Reversehobo · November 15, 2023, 3:07pm

Would also suggest looking at OpenAI’s cookbooks if you haven’t already! For more information, guides and tips on RAG I have found James Briggs youtube videos and notebooks very useful

bruno.vaz · November 15, 2023, 3:10pm

@Reversehobo thanks a lot for your answer!! I will check the material you have sent

githubv321 · October 27, 2024, 4:00pm

Hi!
I am working on a document retrieval task for semantic search. The dataset is of polymer products/materials/grades. I have created a description for every product/grade by concatenating their features like Industry, subindustry they are used in, their application areas, features and so on. I am using colbert for embedding the documents and query. On top of that, I am using cross encoders (in fact I have tried different bi encoders as well) for reranking.

The challenge I am facing is to come up with a threshold to decide relevant documents. The problem is if my query is “water management materials” Vs “I am looking for water management materials”, the scores vary drastically. But the scores should have been very close as the query is eventually meaning the same. Similarly if I search for something that’s not relevant, I get the scores that are close to the scores of the queries for which the results are relevant.

I am not able to see any convergence of scores for relevant vs irrelevant query results. Would appreciate any help!

Topic		Replies	Views
Embedding and searching from similar embeddings API	6	6752	October 27, 2023
Questions about the embedding-based chatbot API embedding	4	161	December 15, 2024
Embeddings giving incorrect results API	27	7881	September 16, 2023
Example incorporation into query formulation API	14	1338	December 16, 2023
Improving Semantic Search Engine Accuracy Using OpenAI Embeddings and Llama VectorStoreIndex API embeddings , gpt-4 , fine-tuning , vector-db , semantic-search	1	1241	May 17, 2024

Document Retrieval in Large Database

Related topics