Entity Extraction

{“prompt”: “I want to order [PRODUCT_B-television]television”, “completion”: “OrderProduct###”}
{“prompt”: “What is the price of [PRODUCT_B-table]table?”, “completion”: “OrderProduct###”}
{“prompt”: “What is the discount on order of 2 [PRODUCT_B-chairs]chairs”, “completion”: “OrderProduct###”}
{“prompt”: “Order one large [PRODUCT_B-book shelf]book shelf”, “completion”: “OrderProduct###”}

This is the jsonl file that I’m using to finetune a model to extract entities and give completion text. But the model is not able to recognize entities. Can somebody help me to extract entities?

You don’t need a fine-tuned model for this. A good prompt will do it. But if you want to extract product IDs (which the AI doesn’t know), then I don’t think you’ll be able to do it as easily as you are trying to

1 Like

Hi @vasundhra362000

These GPT models generally work poorly when the length of the prompt is short. I forgot the exact number of tokens, but I think it was around 500 give or take.

This is also true for generating embeddings. If you try to search a database of text using embedding vectors and some basic linear algebra for example (dot product, cosine similarity, Euclidean distance, etc) when you have “shortish” strings, you will get very poor results.

In your case @vasundhra362000, you are attempting to fine-tune with very short completions.

However, having said that, @nunodonato is (continually) correct when he attempts to guide folks here to not use fine-tuning for these types of tasks. You are better off using embeddings, but in your case, the strings are not long enough to give optimal results.

Hence, if you want an optimal result, you don’t need a GPT-based AI for this, as you can simply store your prompts and completions in a database and use a standard text search (or full text search) to generate the output you seek.

GPT is a “large language model” and you need a “lot of language (long strings of text)” so to speak to get fine-tuning and embeddings to work with a high degree of accuracy. This “costs money” (not free) and is suboptimal from a systems perspective.

Of course, it’s really enjoyable to experiment with the API and GPT-models in general, but for your application, its more optimal (and costs less) to write a little bit of code which use more of an AI “expert system” approach (rules-based) versus trying to fine-tune a GPT, which is more of an autocompletion / text prediction engine and not an expert or rules-based system.


This is just an example, if I’ll use large text then entity annotations will work with fine tune model? Also prompt will only extract the predefined entities as city, person etc. that too we have to sent in each request and I want to extract custom entities.

Sorry, I cannot answer what will “work fine” and “not work fine” with fine-tuning other’s data. Fine-tuning also requires properly formatted data including stops, separators and white space, all of which is documented in the OpenAI docs. So, the JSONL must validate both for basic JSONL syntax AND the requirements specified in the OpenAI API docs.

I’m not sure if the above is a question or not.

If I was developing a production application (such as yours?), I would create a test setup using various methods, including the following:

  • Database search, simple “LIKE” SQL expressions.
  • Database full-text search.

This is because these searches are actual forms of “data extractions” (you words).

GPT-models are not “extraction engines” they are text auto-completion / prediction engines. You cannot “extract” information, you can only fine-tune or train to help the model generate text as it “babbles” about “auto-completing” based on user prompts :slight_smile:

Normally, in many GPT models, the prediction component comes before the decoding component of the architecture, and fine-tuning occurs in the decoding phase, not the prediction phase.

Again, I as stated, from what I know of your use case (so far), using a GPT-based AI model is suboptimal for your application, and if I was developing an application component as you have described (so far) I would use a database and some text search method; especially because what you have exposed so far are “shortish” strings, and these strings are suboptimal for both training and embedding.

If you are simply experimenting and looking to compare, then consider comparing these methods:

  • Database search, simple “LIKE” SQL expressions.
  • Database full-text search.
  • Fine-tuning GPT model(s)
  • Embedding Vectors

Based on what I have read of your requirement, if you develop “well thought out” code for all the above, you will find that one of the DB approaches with text or full-text search will work the best (and cost the lest’ )

An experienced developer should be able to write an experimental test app for each of the above in a few hours each, so it’s easy to compare, which is always good if you wish to be an expert.


OBTW, I am not just “making this up” @vasundhra362000 , I have these components set up already in my “OpenAI Lab” which I run in dev on my desktop:

I have methods to validate the JSONL data as well, so it’s a good idea for developers to validate the data in their app, from both a JSONL basic validation, and an OpenAI API fine-tuning requirement (specified in the API docs).


1 Like