Hi there,
we are developing a chat bot to answer questions on our technical datasheets.
We use the same techinque listed here.
We have a huge amount of documents that are chunked and then converted in embeddings. The results are good but sometimes we found wrong or not complete answers.
This is how we have diveded our docs:
each section is identified by a title and subtitle: title + subtitle are then appended at the top of each chunk with unique separators (es. TITLE ▬ SUBTITLE1 ▬ SUBTITLE2 ▬ \n context). In this way ChatGPT should “understand” that is a reference to a manual section (right?)
each section is limited at maximum 500 tokens (too much, too little?): if a section overflows 500 tokens, it’s divided in parts (so the title could be TITLE ▬SUBTITLE1 ▬ Part 1 ▬ etc.)
if there are tables we try to convert them in JSON
we don’t escapes any special character, but we pass text as is
So the question is: what’s the best practice during the pre-processing of documents?
Do we miss something? Is it better to use large sections with more then 500 tokens?
Is it necessary to include a title in every chunk (yes i suppose)?
Dividing sections into smaller parts, like you mentioned, is a good approach to ensure that each section stays within the token limit. However, 500 tokens is less than 10% of the limit, so this would be a good place to adjust. How long is the largest section? If you could fit all sections into a single embedding if the limit is increased to some n tokens, then I would recommend testing your results with that.
Converting tables into JSON is a good practice as well, as it allows the model to understand the structured data and potentially provide more accurate answers.
Title in the chunks is a good idea, I wouldn’t change that unless the title is somehow contextually irrelevant.
It’s also important to consider how you’re using the retrieved chunks. Depending on your use case, there might be additional strategies you can consider in both your embedding pre-processing and retrieval processes. For example, if you’re not retrieving the results that you’d expect from your current search queries, you could try something like HyDE.
Yes, this was one of our doubts. We have sections even more than 7000 tokens so we must chunk sections in some cases: but we could try to increase the limit to the maximum (4k with ADA if i remember right) and see the difference.
HI @wfhbrian ,
i’m trying to understand how to use HyDE in our use case.
I’m not quite sure to have understood: the Hypothetical answer is given directly from GPT itself? I mean, i ask GPT one question and then use the answer to perform embeddings search on my data?
In my understanding Hypothetical document could be generated based on my dataset, but in all examples i’ve found seems they are generated based on public data on web: in this way, how they could contain relevant data specific to our use case?
Hyde is all about a hypothetical answer to the question being asked. You then use that to create an embedding and perform search on it against your corpus.
You can pair it with lots of other techniques (such as slot filling, seeding etc) and it’s very effective at getting accurate answers back . (Compared to just searching on what user has inputted)
Hi @lids i’ve understood that the answer is hypothetical, but i’m not quite sure how to perform the question to get the hypothetical answer: am i supposed to ask directly to GPT-3.5-turbo or against my corpus?
Send to GPT, see the HYDE paper for example prompts.
Then create an embedding from that answer to compare to the corpus embeddings and retrieve . This performs significantly better than passing the question in.
Just want to add some detail here for anyone who hasn’t read the HyDE paper yet.
The reason why this method works better than just searching based on the embedded prompt is because of the structure of each.
Even if the hypothetical answer is factually incorrect, because of the way GPT-based LLMs work, when it produces a plausible answer, that incorrect but plausible answer will look like the correct answer. The theory is it’ll have similar structure and inner-connectedness between tokens which the embedding models will pick up on, causing them to be more proximal.
Then, after pulling things from your vector DB, the prompt is re-run with the relevant data and the model can generate a plausible answer which has the added benefit of being more correct.