Prompting GPT3.5 for NER data labeling


I’m trying to get ChatGPT to label text data. My text data consists of incomplete sentences and tons of comma deliminated phrases. Basically think of text as lists or short ideas. I’m having trouble getting either consistent answers or the data returning in a clean json format.

Here is an example of one such concept:

Jim Carrey, the mask, child, yellow suit, in a theater.

I want gpt to label the text as following

{“people”: “Jim Carrey”,
“fictional characters”: “the mask”,
“Age terms”: “child”,
“other”: “Yellow”, “suit”, “in”, “a”, “theater”}

Any advice would be really helpful - thank so much

my current template is this:

template = f"“”

You are a data labeler labeling data to be used in token/Named Entity Recognition.

Create a JSON response using GPT-3.5 that categorizes words from the provided text into specific keys.

The keys should include People, Fictional Characters, Age-related Terms, and Other.

The values of these keys should be lists containing relevant words from the input text.

Note: Age related terms should be related to the age of people or animals, specifically.

Age should not include gender or sex terms


input text: Jim Carrey, the mask, child, yellow suit, in a theater.

“people”: “Jim Carrey”,
“fictional characters”: “the mask”,
“Age terms”: “child”,
“other”: “Yellow”, “suit”, “in”, “a”, “theater”



The problem I get with this template is I get inconsistent strings in some categories. For instance I get a lot of gender words in the age category and I get things like hair or clothing terms in the fictional character/people category

Hi Wolfgang - Welcome to the Forum.

Have you considered creating a fine-tuned GPT 3.5 model? This use case seems like a good fit for one.

1 Like

yeah - its a good idea… any experience with it? How many examples should I consider?

Done quite a few of these with really positive experiences.

Data set size depends a bit on the diversity of categories. My initial recommendation would be to start with a few hundreds and see where it takes you. Then make a strategic decision around whether and how to expand and refine your training set.

Are your labels open ended or do you have a complete list of possible labels to choose from? The latter obviously helps to steer a finetuned model a lot.

I’m using regex rn to find people and age terms… so I have some examples of those…

My main concern with finetuning is I have quite a bit of NSFW terms in my main data corpus… I could train on a few hundred s(er)fw examples that have people and age terms… and hope that the model I train for NER after recognizes the NSFW terms as other…

also - any advice on getting outputs as json in 3.5? It works well in 4 with the current template but not with 3.5-turbo-instruct

You might be interested in this,

1 Like

Ok. If you don’t have a closed-ended list of terms that is fine, too. The model should still pick up the overall pattern.

You should include examples of NSFW terms in your training set for the model to understand how to treat these.

In terms of JSON, yes you can instruct the model via fine-tuning to respond in a desired JSON format. Again here, tried and tested and works very well. I agree that in a non-finetuned setting, GPT-4 is inherently better at this but you can definitely get consistent JSON results with a finetuned GPT 3.5.

Finally, ensure your system prompt is specific. If you are for instance worried about the volume of words for a given category, then simply include restrictions in your system prompt in this regard (i.e. no more than X).

1 Like

Thanks @elmstedt - thats more or less what I’m trying to augment. I hope to benefit from finetuning a NER/Token Classifier downstream using this GPTlabeled dataset. So for that, I need a few thousand examples. It looks like my current workflow is something like

  1. Use GPT4 to label a few 100 examples with my specific token/labels
  2. Finetune GPT3.5-instruct
  3. Use Finetuned GPT3.5 to label some 100 examples, check if successful go to 4, otherwise repeat 1-3
  4. Use Finetuned GPT3.5 to label some thousands of examples (with min 1000 examples in people, age, ect categories)
  5. Finetune Roberta/Distilbert originally trained on CoNLL for my purposes
  6. Celebrate :joy:

@jr.2509 thanks for the suggestions so far.

I’m dubious about finetuning on NSFW text as I’ve already been flagged and denied in other examples… when trying to build a nsfw text classifier. Have you had success in finetuning on NSFW text?

1 Like

Unfortunately my data did not involve NSFW terms.

That said, maybe you can just include instructions in your system prompt in this regard, i.e. labeling any NFSW terms as “Other”. It’s a bit trial and error but I could see it working out in practice.

1 Like

one more pointer re JSON format. In your system prompt, include the generic JSON schema that you want the model to respond in addition to including the specific JSONs as example assistant outputs.

1 Like

Here’s the issue I see with this, 100 examples is almost certainly far too few to get very good results.

How many labels do you have?

Just to clarify, the 100 examples are to check if I like how the finetuned gpt3.5 labeled the data for me.

I want it to create 5 or so classes for token classification. If It did the trick, I can try to label a few 1000 and again randomly sample 100 to spot check BEFORE finetuning the NER transformer…

if the finetuned gpt needs more help I can add more examples and refinetune…

But to eventually train the NER transformer I’m aiming for around 1000 examples for each class (Which can occur in as little as 1000 training examples but will more likely be 3k-6k examples)

Does that make sense?

Yeah, that makes more sense, I must have misunderstood your plan.

One reason I sent the link I did, is that you may have some luck finding and modifying one or more of the pre-labeled datasets to use your labels which would be a quick (and cheap) way to get a relatively large dataset to train from.

1 Like

Ahhhh you know I haven’t really considered respec-ing an old dataset… I’ll definitely look into it.

Thanks for the conversation~

Curious to know ballpark costs for the fine tuning!

great workflow. sounds like you have done this. how successful was the project? whats the accuracy like?

I’m actually also looking at solution for NER data labelling right now, and I was thinking about implementing a fine tuned version of the BERT NER model, do you think this could also be a good option ?

context: I want to extract around 10 pairs of entity-value in one particular type of document.