I’m trying to get ChatGPT to label text data. My text data consists of incomplete sentences and tons of comma deliminated phrases. Basically think of text as lists or short ideas. I’m having trouble getting either consistent answers or the data returning in a clean json format.
Here is an example of one such concept:
Jim Carrey, the mask, child, yellow suit, in a theater.
The problem I get with this template is I get inconsistent strings in some categories. For instance I get a lot of gender words in the age category and I get things like hair or clothing terms in the fictional character/people category
Done quite a few of these with really positive experiences.
Data set size depends a bit on the diversity of categories. My initial recommendation would be to start with a few hundreds and see where it takes you. Then make a strategic decision around whether and how to expand and refine your training set.
Are your labels open ended or do you have a complete list of possible labels to choose from? The latter obviously helps to steer a finetuned model a lot.
I’m using regex rn to find people and age terms… so I have some examples of those…
My main concern with finetuning is I have quite a bit of NSFW terms in my main data corpus… I could train on a few hundred s(er)fw examples that have people and age terms… and hope that the model I train for NER after recognizes the NSFW terms as other…
also - any advice on getting outputs as json in 3.5? It works well in 4 with the current template but not with 3.5-turbo-instruct
Ok. If you don’t have a closed-ended list of terms that is fine, too. The model should still pick up the overall pattern.
You should include examples of NSFW terms in your training set for the model to understand how to treat these.
In terms of JSON, yes you can instruct the model via fine-tuning to respond in a desired JSON format. Again here, tried and tested and works very well. I agree that in a non-finetuned setting, GPT-4 is inherently better at this but you can definitely get consistent JSON results with a finetuned GPT 3.5.
Finally, ensure your system prompt is specific. If you are for instance worried about the volume of words for a given category, then simply include restrictions in your system prompt in this regard (i.e. no more than X).
Thanks @anon22939549 - thats more or less what I’m trying to augment. I hope to benefit from finetuning a NER/Token Classifier downstream using this GPTlabeled dataset. So for that, I need a few thousand examples. It looks like my current workflow is something like
Use GPT4 to label a few 100 examples with my specific token/labels
Finetune GPT3.5-instruct
Use Finetuned GPT3.5 to label some 100 examples, check if successful go to 4, otherwise repeat 1-3
Use Finetuned GPT3.5 to label some thousands of examples (with min 1000 examples in people, age, ect categories)
Finetune Roberta/Distilbert originally trained on CoNLL for my purposes
I’m dubious about finetuning on NSFW text as I’ve already been flagged and denied in other examples… when trying to build a nsfw text classifier. Have you had success in finetuning on NSFW text?
That said, maybe you can just include instructions in your system prompt in this regard, i.e. labeling any NFSW terms as “Other”. It’s a bit trial and error but I could see it working out in practice.
one more pointer re JSON format. In your system prompt, include the generic JSON schema that you want the model to respond in addition to including the specific JSONs as example assistant outputs.
Just to clarify, the 100 examples are to check if I like how the finetuned gpt3.5 labeled the data for me.
I want it to create 5 or so classes for token classification. If It did the trick, I can try to label a few 1000 and again randomly sample 100 to spot check BEFORE finetuning the NER transformer…
if the finetuned gpt needs more help I can add more examples and refinetune…
But to eventually train the NER transformer I’m aiming for around 1000 examples for each class (Which can occur in as little as 1000 training examples but will more likely be 3k-6k examples)
Yeah, that makes more sense, I must have misunderstood your plan.
One reason I sent the link I did, is that you may have some luck finding and modifying one or more of the pre-labeled datasets to use your labels which would be a quick (and cheap) way to get a relatively large dataset to train from.
I’m actually also looking at solution for NER data labelling right now, and I was thinking about implementing a fine tuned version of the BERT NER model, do you think this could also be a good option ?
context: I want to extract around 10 pairs of entity-value in one particular type of document.