Fine-tuning for better extraction

Hi @quentinDLF ,

Welcome to the forum. I had a similar problem couple of years ago which was solved with the combination of semantic chunking, rag and custom data extractors. The whole solution ended up being as an analysis and data mining framework. Our use case is for legal documents but I see that the same thing can be easily applied to your use case.

I think I can help you to figure that out and that will be a great example for the new service I’m launching here: https://www.simantiks.com

Can you please share an example of a file you are extracting the data from and what kind of data you need to extract.

The data extractor description should look like this (approximately):

Question: what is the date of the event?

Queries:

  • event date
  • on … at

Examples:

  • October 2nd, 2024
  • mm/dd/yyyy

Where are the question is basically the instructions for the llm to parse the input and produce output, also used as query to RAG.
Queries is a list of words sentences keywords similar to what it looks like in the search document. The query vector is adjusted by 0.6 towards the center of the vectors of the queries to improve the rag precision.
Examples are provided to llm as the examples of desired output format.

If you can provide me several examples of the above and The Source file or files if you wish I can run it through my framework and report the results here so that we continue the discussion.

Sure the whole thing will be free I’m gonna use it as a marketing case.