Fine-tuning for text classification / finding relevant parts in huge documents

I want to fine-tune a model to find relevant (funny, interesting, shareable …) text snippets within my text documents.
I have 10 documents each with around 90K characters with manually (human) generated and ratings 1-5 ,text snippets and reasons (why it is selected).
My idea was to fine tune 4o or 4o-mini. The result didn’t improved the output.
Do you think that it makes sense to use fine tuning here?

1 Like

I would use RAG with some functions to find relevant parts from your documents to feed the model with your Updated Data to check it.

Check this project on GitHub , you have a notebook to some tests.

1 Like

thanks so much for your fast answer. So you think fine tuning makes no sense in this case.
Instead I should prefilter with RAG and then use the response for gpt-4o.
I will give it a try.

1 Like

You’re welcome. Fine-tuning can help shape the model if needed, but I’d recommend starting small first.

  1. Start by Talking to Your Data: Begin by analyzing smaller subsets of your data to see what insights emerge. This approach is highly effective for understanding patterns and getting a sense of what works before scaling up.
  2. Embed Your Data: Use OpenAI embeddings to create a vector store for all your text snippets. This will enable efficient storage and retrieval of relevant information when processing large documents.
  3. Leverage a RAG Pipeline: Incorporate a RAG (Retrieval-Augmented Generation) pipeline. Use it to retrieve relevant snippets, feed them into the model, and compare results with updated data. Depending on the situation, you can adapt this approach dynamically to refine your outcomes.
  4. Automate and Refine: After gathering results, move toward automation. Use your insights and tools to streamline the process for larger datasets while ensuring precision.