Fine tuning model for custom entity extraction

Hi everyone, I can’t find any previous post related to my use case so posting this to get some starting direction for a use case.

I have a usecase where I need to find specific custom entities from large text files. The files are legal documents and my entities are quite well defined. The raw text files are in the following format


And so on. The questions are my entities and the base 3.5 model returns decent enough results if the answers are really small ones(No, Not available, Not applicable) but it struggles to extract entities and their answers when the answers like descriptive(for example if a question is about a
legal matter and the answer is Yes, there is usually a page or a few paragraphs of text) but the base model either only returns a few lines of text or detects part of answer as separate questions.

Is fine tuning a model the right way to go about it or am I thinking it wrong? I have explored some specific custom entities models including AWS and Azure but they are quite expensive and just not accurate enough.

Can you explain more what exactly it is you’re trying to do?

It sounds like you’re just trying to conduct some type of fuzzy search on a flat-file-as-database which, if that’s the case, I’d suggest an LLM is not the appropriate tool for the job.

Broadly speaking, it is quite like Fuzzy Search. I do plan to use the extracted bits to then create relevant embedding so I am going to use open AI anyway.

I may be missing something really obvious but can you please point me why I shouldn’t be using open AI for fuzzy search? I am hoping it does a better job than the other models I mentioned (or libraries like FuzzyWuzzy etc) as the raw text I plan to pass to extract entities, can be prone to mistakes due to original PDF being in funny layouts and how text extraction libraries can extract the text in wrong order due to layouts etc.

Using AI isn’t the problem, rather a large language model isn’t particularly well suited to act as a search engine.

You might try something like breaking up the file into individual Q/A pairs and embedding them into something like Pinecone, then you’d just encode your search string with the same embedding API and surface the near matches in the vector database with semantic search.

But, that’s a lot of API calls every time you want to find something (presumably verbatim). Because, what I’m suspecting you want to do is be able to find the exact and complete text of a specific Q/A pair based on some probably imperfect search queries.

I just suspect there are other options out there that would allow you to surface the relevant text without resorting to spending money every time you do.

I’m not saying a GPT can’t do this or even that you couldn’t get it to do it better than some other methods, just that it seems like an absurd amount of overkill to get what you want.

In the end I guess it depends on your budget, how fuzzy the search needs to be, how much effort you want to put into it, how much you already know, and so on.

1 Like