How Can I Use the OpenAI API to Categorize Large Amounts of Text Data?

I’m working with a large dataset of customer feedback, specifically reasons for subscription cancellation, sourced from a web form. And at the core of my enquiry is simply the desire to make this set of texts to be adequately structured to make it more graspable and understood.

One way I know to achieve this is to categorize this feedback into a good number of categories for easier analysis.

A well defined set categories and the number of texts fitted into each category would probably be a good starting point.

I want the categories to be produced by the AI,. Simply because I have seen that it seems to perform that task quite well when I’ve attempted to this with smaller set of texts.

I’ve explored using OpenAI’s GPT in the playground and it seems promising - when I input a small sample of around 50 texts, it can generate relevant categories and sort the texts accordingly.

However, I’m having trouble scaling this up. I have thousands of texts to categorize and I’m hitting an issue with the API’s input limitation.

My goal is to create a categorization structure that can handle this larger volume of texts. Is there a way to achieve this using the OpenAI API? Or perhaps a workaround for the input size limitation? Any guidance or alternative approach to categorizing this dataset would be appreciated.

Is there a reason you can’t split the input across multiple API calls and merge the results afterward?

1 Like

I should have explained clearer. The problem is that the set of categories will change depending on the input. E.g. say I have 1000 texts and the task is to create 10 categories and distribute the text over these. if I make a request with say the first 50 texts, it will produce categories c1, c2 … c10. But these will not be as relevant for the upcoming requests, with different texts.

So as I see it I can either ask it to produce new categories for each requests. This will end up producing up to 1000/50 * 10 = 200 categories (instead of the desired 10). Or I can take the categories produced by the first request, and force the subsequent requests to classify the texts into these. But the categories will not be nearly as relevant for the other texts as they were produced when the model only had access to a small subset of the texts.

Or, you can take the categories produced by the first request and then on the second request, submit that list and ask it to add the texts that match the existing categories, and create new categories for those that don’t match. The third request you submit that modified list, and so on.

But, it seems to me the crux of this situation is determining the elements that make one category distinct from another, and then creating the prompt that makes sure your model follows those specifications. You can’t ask the question “give me 10 categories”, but rather, create categories based upon x, y and z. At the end, if you have more than 10, consolidate those that can be consolidated.

This definitely looks like a case where a 32K to 100K token context window would come in handy.