Hi, I am pretty new to GPT 3. I am working on a project to try to classify thousands of sraped review data into about 150 categories based on their meaning. Can GPT3 help me to achieve that?
In short, yes. GPT-3 can do that. Here the classification guide: OpenAI API
The Classifications endpoint (/classifications) provides the ability to leverage a labeled set of examples without fine-tuning and can be used for any text-to-label task. By avoiding fine-tuning, it eliminates the need for hyper-parameter tuning. The endpoint serves as an “autoML” solution that is easy to configure, and adapt to changing label schema. Up to 200 labeled
examplesor a pre-uploaded
filecan be provided at query time.
Agreed. I don’t know much about embeddings, since I haven’t used it. Going to try it myself.
Thank you so much for your reply! sps!!
So in the endpoint AutoML solution, I can upload 200 labeled examples to train the model? and then us the model in my own dataset? In my data, I have about multiple million records would like to classify in those probably 150 classes. Another question, those 150 classes is what we think those multiple million records belong to, is OpenAI GPT-3 can do some unsupervised machine learning to tell us besides those 150 classes whether there are other topics we can cluster? I tried myself to use the embedding and K-means but not that meaningful.
Thank you so much m-a.schenk !
What I am trying to do is to classify my multi-million records and find some new classes if they are not in those 150 classes. Do you mean the new function embedding in OpenAI will be available in a few days?
That’s an interesting question if we can do some unsupervised learning on our data using the classification endpoint. If my understanding is correct then it’s not possible at the moment because it needs examples of labels. Maybe, a label ‘other’ could be created to group all the data that doesn’t belong to any predefined labels, but how it would be achieved and by what examples is another interesting question.
I would use embeddings. Get the embedding for each of your n million records, and do the same for each of your 150 classes, and then get the similarity scores to measure how close each record is to each class, semantically speaking. You’ll have n million times 150 scores. (That’s a lot - I am not sure of the cost or speed implications.) Then you can assign each record to a class based on the highest similarity score. If the highest and second highest similarity scores are quite close, manually review those ones or assign them to more than one class, if that works for your use case. If the highest similarity score is notably lower than the average highest similarity score, that’s a red flag that a new class may need to be created. If there are lots of those, you can probably cluster them into groups based on semantic similarity, then extract some keywords for each cluster to help develop the new classes.
Agreed. This is a great explanation on how to best use embeddings in this scenario, both cost-wise and given the amount of data that needs to be processed.
@lmccallum Thank you so much! I tried k means before by using embeddings but that is not that helpful. Your solution sounds great. Do you think it is available to use GPT-3 or openAI to solve the problem? Again, thank you so much!
If you know Python:
Here’s an example of classification using embeddings: openai-python/Classification.ipynb at main · openai/openai-python · GitHub
And here’s an example of classification using fine-tuned completions: openai-python/finetuning-classification.ipynb at main · openai/openai-python · GitHub