Fine-tuning for better extraction

quentinDLF · August 20, 2024, 4:50pm

Hello everyone,

I’m kinda new in the AI field but kinda excited by all the possibilities. So nice to see such a helpful community.

I already use Assistants API with File Search and 4o-mini to extract information from 1 file (~4000caracters) given by a user, no link between files. Uploaded files are not generic (different words or presentation) but follow kind of the same semantic (list of events based on chronologic dates). But sometimes, some events are missing in the extraction and I don’t understand why because they follow the same pattern that the ones already found in the same document. I need to ask the assistant 2/3 times more that information are still missing to get an exact extraction.

I split my prompt to be very simple (“don’t forget any event, verify, etc”) and still getting incomplete extraction. When I just ask to sort by chronological date, even after another prompt, the order is still sometimes incorrect.

My question is : would fine-tuning could be helpful to be sure all the events are extracted without missing one ? Or sticking to multi-prompt ?
I also read about RAG but not sure yet what could be the best.

Thank you. Have a nice day.

sergeliatko · August 20, 2024, 7:15pm

Hi @quentinDLF ,

Welcome to the forum. I had a similar problem couple of years ago which was solved with the combination of semantic chunking, rag and custom data extractors. The whole solution ended up being as an analysis and data mining framework. Our use case is for legal documents but I see that the same thing can be easily applied to your use case.

I think I can help you to figure that out and that will be a great example for the new service I’m launching here: https://www.simantiks.com

Can you please share an example of a file you are extracting the data from and what kind of data you need to extract.

The data extractor description should look like this (approximately):

Question: what is the date of the event?

Queries:

event date
on … at
…

Examples:

October 2nd, 2024
mm/dd/yyyy

Where are the question is basically the instructions for the llm to parse the input and produce output, also used as query to RAG.
Queries is a list of words sentences keywords similar to what it looks like in the search document. The query vector is adjusted by 0.6 towards the center of the vectors of the queries to improve the rag precision.
Examples are provided to llm as the examples of desired output format.

If you can provide me several examples of the above and The Source file or files if you wish I can run it through my framework and report the results here so that we continue the discussion.

Sure the whole thing will be free I’m gonna use it as a marketing case.

Topic		Replies	Views
Fine tuning model for custom entity extraction API fine-tuning	1	1625	May 11, 2023
Looking for Tips to Improve Document Search and Thread Management in OpenAI Assistant API API api , semantic-search , threads , assistants-api , assistants-files	5	386	August 22, 2024
Fine-tuning for text classification / finding relevant parts in huge documents Community fine-tuning	3	130	December 2, 2024
Fine tuning vs. uploading data & using in file search with prompts API assistants-api	1	76	March 22, 2025
New Assistant feature and Fine-tuning API	4	3743	February 5, 2024

Fine-tuning for better extraction

Related topics