Fine-tuning for text classification / finding relevant parts in huge documents

okhh · December 2, 2024, 3:48pm

I want to fine-tune a model to find relevant (funny, interesting, shareable …) text snippets within my text documents.
I have 10 documents each with around 90K characters with manually (human) generated and ratings 1-5 ,text snippets and reasons (why it is selected).
My idea was to fine tune 4o or 4o-mini. The result didn’t improved the output.
Do you think that it makes sense to use fine tuning here?

razvan.i.savin · December 2, 2024, 3:55pm

I would use RAG with some functions to find relevant parts from your documents to feed the model with your Updated Data to check it.

Check this project on GitHub , you have a notebook to some tests.

okhh · December 2, 2024, 4:43pm

thanks so much for your fast answer. So you think fine tuning makes no sense in this case.
Instead I should prefilter with RAG and then use the response for gpt-4o.
I will give it a try.

razvan.i.savin · December 2, 2024, 5:30pm

You’re welcome. Fine-tuning can help shape the model if needed, but I’d recommend starting small first.

Start by Talking to Your Data: Begin by analyzing smaller subsets of your data to see what insights emerge. This approach is highly effective for understanding patterns and getting a sense of what works before scaling up.
Embed Your Data: Use OpenAI embeddings to create a vector store for all your text snippets. This will enable efficient storage and retrieval of relevant information when processing large documents.
Leverage a RAG Pipeline: Incorporate a RAG (Retrieval-Augmented Generation) pipeline. Use it to retrieve relevant snippets, feed them into the model, and compare results with updated data. Depending on the situation, you can adapt this approach dynamically to refine your outcomes.
Automate and Refine: After gathering results, move toward automation. Use your insights and tools to streamline the process for larger datasets while ensuring precision.

Topic		Replies	Views
Fine tuning: what is it good for? Community fine-tuning	5	10735	October 12, 2023
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	3	337	August 23, 2024
Fine tuning very very poor results API fine-tuning , api	16	2767	July 11, 2023
Using fine-tuning for operational report generation API	0	437	April 15, 2023
Fine tuning model for custom entity extraction API fine-tuning	1	1611	May 11, 2023

Fine-tuning for text classification / finding relevant parts in huge documents

Related topics