Hello everyone,
I’m kinda new in the AI field but kinda excited by all the possibilities. So nice to see such a helpful community.
I already use Assistants API with File Search and 4o-mini to extract information from 1 file (~4000caracters) given by a user, no link between files. Uploaded files are not generic (different words or presentation) but follow kind of the same semantic (list of events based on chronologic dates). But sometimes, some events are missing in the extraction and I don’t understand why because they follow the same pattern that the ones already found in the same document. I need to ask the assistant 2/3 times more that information are still missing to get an exact extraction.
I split my prompt to be very simple (“don’t forget any event, verify, etc”) and still getting incomplete extraction. When I just ask to sort by chronological date, even after another prompt, the order is still sometimes incorrect.
My question is : would fine-tuning could be helpful to be sure all the events are extracted without missing one ? Or sticking to multi-prompt ?
I also read about RAG but not sure yet what could be the best.
Thank you. Have a nice day.
Hi @quentinDLF ,
Welcome to the forum. I had a similar problem couple of years ago which was solved with the combination of semantic chunking, rag and custom data extractors. The whole solution ended up being as an analysis and data mining framework. Our use case is for legal documents but I see that the same thing can be easily applied to your use case.
I think I can help you to figure that out and that will be a great example for the new service I’m launching here: https://www.simantiks.com
Can you please share an example of a file you are extracting the data from and what kind of data you need to extract.
The data extractor description should look like this (approximately):
Question: what is the date of the event?
Queries:
Examples:
- October 2nd, 2024
- mm/dd/yyyy
Where are the question is basically the instructions for the llm to parse the input and produce output, also used as query to RAG.
Queries is a list of words sentences keywords similar to what it looks like in the search document. The query vector is adjusted by 0.6 towards the center of the vectors of the queries to improve the rag precision.
Examples are provided to llm as the examples of desired output format.
If you can provide me several examples of the above and The Source file or files if you wish I can run it through my framework and report the results here so that we continue the discussion.
Sure the whole thing will be free I’m gonna use it as a marketing case.