Hi @akbayt , as Curt said, embedding is the way to go with it to make sure the model knows what to answer. Fine running is optional and might help you training the model how to answer. Feel free to reach out to me if you need help with document processing for the meaningful embeddings.
Any idea on the limit of data size? That is, at what document size embedding becomes THE way to go?
Then, for embeddings: segmenting Word documents is not a problem - the problem is PDF (at least for me) where you don’t have Object Model inside… The only thing I found is ASPOSE library (I think) where you can get the paragraphs…but that costs quite a bit. Any advice?
You can start by examining an open source project like EmbedChain
For PDF’s it uses a library to break out the text inside of them.
For general chunk size, I have in my pipeline to embed dynamically from the original source, after “getting close” off of an initial embedding match, to maximize cosine similarity with the incoming tokens.
But without any fanciness, try embedding at the “thought” level, so 3 or 4 paragraphs or so. Whatever size that is likely to contain an entire thought, without being overly fragmented and “scatterbrained” when fed back into the LLM.
While I agree that fine-tuning may not be the best approach for answering documents, sampling data through semantic search doesn’t always provide meaningful chunks, particularly for distributed knowledge. Some chunks may appear irrelevant, despite being necessary to answer the question. I am exploring whether fine-tuning can be useful in such cases and, if so, how to convert the document into a training set. Should we create question/answer pairs for the entire document manually?
I might know who is this coming from… here is the update: I have a working algo that cuts the text precisely “at thought”. This way the chunks contain one idea at a time and are perfect for embedding where the goal is to get a “ready to go” context item directly from the database without further processing by additional model, which reduces the operational costs.
So you have some incoming question, then the LLM “makes up” an answer, and then you correlate this answer with your data, then use your real data to feed back into the LLM and answer the question correctly.
This might be better suited for you than a fine-tune, but if you’re in a super-detailed niche area, then maybe the fine-tune would perform better.
So in this context, your pipeline would have:
Question → Fine-Tune → Rough Answer → Cross-Correlate with your data → Retrieved chunks → (Prompt + LLM) → True Answer
The difference is that the native LLM might provide a good enough Rough Answer to correlate with. Just depends on the domain, and how much you need to teach the LLM to provide decent Rough Answers.
So the LLM non-FT route is here (and about 10x cheaper!):
Question → Raw LLM → Rough Answer → Cross-Correlate with your data → Retrieved chunks → (Prompt + LLM) → True Answer
Rather, in this case, one must fine-tune on both questions and answers inspired from the text (for this quantity, synthesized by AI), while still having inclusion of maximum verbatim passages useful to inform as assistant answers if replay of documentation is required by questioning.
The ability of a language model itself is by weighting, and the re-weighting of fine tune gives it new patterns for given inputs.
Sounds interesting. You are right. The solution might be a combination of the methods.
I am very skeptical about the fine-tuning. I assume OpenAI is training an additional layer (like PEFT), and this doesn’t give much chance for the model to think about the newly learned data. It will inevitably answer with a mixture of info from its memory. Combining all sources (fine-tuning + context + HyDE) must be the way to go.
(By the way, I am trying to find a good way for document QA. There is no concrete case. And one might say, there is no rule of thumb for all types of data… and I would agree with that smart person )
Yeah when going with HyDE, you would only need to teach it things with the fine-tune for obscure areas the native LLM might be bad at.
Some examples might be obscure local laws, specific policies you have that the native LLM has no shot at getting close to, etc.
Another approach, for these severe “insulated” situations, insulated between Question and True Answer, is to embed the incoming question “Q0”. Then find the nearest embedding question in your collection “Q1”. Then use your answer “A1” through a simple lookup on Q1.
Without a total “proxy answer” A1, which may not completely answer Q0, you could HyDE the A1 into your broader data set (along with the original Q0), to get even more context, and feed this back into the LLM to form the True Answer.