Fine-Tuning with help of massive amount of documents

How do you fine-tune the model if you have 10,000 of documents?

  1. First, gpt-3.5-turbo model - is it possible to fine tune it?
  2. The goal is to feed the documents to new gpt-3.5-turbo model and then be able to ask questions and get answers.
  3. So if 1+2 feasible, how do you feed the docs? by using fine-tuning API?

You can’t fine-tune turbo. You should really look into embeddings.


ok, good to know.
How about answers for par. #2 and #3?

Follow the pattern found here.

1 Like

Now that the turbo model can be trained, the question is now valid.

Does anyone have an answer?

You can now fine-tune turbo, but if you have 10,000 documents, I would still use embeddings. Much cheaper, and less prone to hallucinations.


Hi @akbayt , as Curt said, embedding is the way to go with it to make sure the model knows what to answer. Fine running is optional and might help you training the model how to answer. Feel free to reach out to me if you need help with document processing for the meaningful embeddings.

  1. Any idea on the limit of data size? That is, at what document size embedding becomes THE way to go?
  2. Then, for embeddings: segmenting Word documents is not a problem - the problem is PDF (at least for me) where you don’t have Object Model inside… The only thing I found is ASPOSE library (I think) where you can get the paragraphs…but that costs quite a bit. Any advice?

You can start by examining an open source project like EmbedChain

For PDF’s it uses a library to break out the text inside of them.

For general chunk size, I have in my pipeline to embed dynamically from the original source, after “getting close” off of an initial embedding match, to maximize cosine similarity with the incoming tokens.

But without any fanciness, try embedding at the “thought” level, so 3 or 4 paragraphs or so. Whatever size that is likely to contain an entire thought, without being overly fragmented and “scatterbrained” when fed back into the LLM.

1 Like

@curt.kennedy, @sergeliatko, thanks for the quick replay.

While I agree that fine-tuning may not be the best approach for answering documents, sampling data through semantic search doesn’t always provide meaningful chunks, particularly for distributed knowledge. Some chunks may appear irrelevant, despite being necessary to answer the question. I am exploring whether fine-tuning can be useful in such cases and, if so, how to convert the document into a training set. Should we create question/answer pairs for the entire document manually?

For distributed, have the LLM answer in parallel from multiple hits, then condense and refine the multiple answers back into one or two final answers, and pick the best one.

Don’t forget about keyword searches too. You can run both keyword and semantic at the same time and pick the best of both.

I’m just not sure how fine-tuning will help figure this out. What do you intuitively think fine-tuning will do here?

1 Like

I might know who is this coming from… here is the update: I have a working algo that cuts the text precisely “at thought”. This way the chunks contain one idea at a time and are perfect for embedding where the goal is to get a “ready to go” context item directly from the database without further processing by additional model, which reduces the operational costs.

1 Like

Exactly, that’s probably the most tricky part and not knowing the domain of knowledge is difficult to figure out the approach to take to reduce this risk.

What are the docs about and what is the primary goal of the solution you’re building?

I’m wondering if the model itself could either help with, or completely be in charge of deciding the best method to choose… Seems within the realms of what GPT-4 could do.

1 Like

Any way you can share this? Or give a general outline?

I do not :slight_smile: That was exactly what I was wondering: if the model can deduct such relations from the question-answer training. I plan to test it later and if I can get any results, I will drop them here.

1 Like

This sounds similar to HyDE, BTW.

So you have some incoming question, then the LLM “makes up” an answer, and then you correlate this answer with your data, then use your real data to feed back into the LLM and answer the question correctly.

This might be better suited for you than a fine-tune, but if you’re in a super-detailed niche area, then maybe the fine-tune would perform better.

So in this context, your pipeline would have:

Question → Fine-Tune → Rough Answer → Cross-Correlate with your data → Retrieved chunks → (Prompt + LLM) → True Answer

The difference is that the native LLM might provide a good enough Rough Answer to correlate with. Just depends on the domain, and how much you need to teach the LLM to provide decent Rough Answers.

So the LLM non-FT route is here (and about 10x cheaper!):

Question → Raw LLM → Rough Answer → Cross-Correlate with your data → Retrieved chunks → (Prompt + LLM) → True Answer

Rather, in this case, one must fine-tune on both questions and answers inspired from the text (for this quantity, synthesized by AI), while still having inclusion of maximum verbatim passages useful to inform as assistant answers if replay of documentation is required by questioning.

The ability of a language model itself is by weighting, and the re-weighting of fine tune gives it new patterns for given inputs.

Sounds interesting. You are right. The solution might be a combination of the methods.

I am very skeptical about the fine-tuning. I assume OpenAI is training an additional layer (like PEFT), and this doesn’t give much chance for the model to think about the newly learned data. It will inevitably answer with a mixture of info from its memory. Combining all sources (fine-tuning + context + HyDE) must be the way to go.

(By the way, I am trying to find a good way for document QA. There is no concrete case. And one might say, there is no rule of thumb for all types of data… and I would agree with that smart person :slight_smile: )

1 Like

Yeah when going with HyDE, you would only need to teach it things with the fine-tune for obscure areas the native LLM might be bad at.

Some examples might be obscure local laws, specific policies you have that the native LLM has no shot at getting close to, etc.

Another approach, for these severe “insulated” situations, insulated between Question and True Answer, is to embed the incoming question “Q0”. Then find the nearest embedding question in your collection “Q1”. Then use your answer “A1” through a simple lookup on Q1.

Without a total “proxy answer” A1, which may not completely answer Q0, you could HyDE the A1 into your broader data set (along with the original Q0), to get even more context, and feed this back into the LLM to form the True Answer.

1 Like