API choice for research question

Hi! I have a research project with a quite specific use-case of large language models, and am looking for some advice on exactly what form of the APT API would be most appropriate.

I have several hundred reasonably short (100ish words each) dialogue extracts that I would like to classify into one of 66 categories. I also have a somewhat lengthy (18 pages) document explaining exactly what each of the 66 categories are and how to perform the classification.

It seems that something batch-based would be appropriate since the classification tasks are effectively independent, but I would rather not include the instruction document with every single request. Is there a way of submitting a batch-based job with some prior instructions shared between all of the requests?

All advice welcome, I’m still very new to this.

It is important to understand that almost all language models, and all OpenAI, are stateless input/output machines. You give language such as instructions, include data, behaviors, what to write based on that, and receive your output. The memory of the input is immediately gone.

Your categories are the largest challenge, needing thorough understanding all at once. This can either be providing “prompt”, input context message with all the information required to perform a task, or one can develop just as many example responses as you have data, that you could use to fine-tune an AI model, reducing the amount of actual instructions and category description quality, because some of the decision making could be built-in or inferred.

You would be pushing the model high into its capabilites to have decisions made on such a large classification instruction. In the domain of artificial intelligence, particularly when working with advanced language models such as GPT-3.5 and GPT-4, understanding the concept of ‘tokens’ is crucial. Tokens represent the basic units of processing for these models. They are a way of breaking down input text and generating output. This tokenization involves dividing text into manageable pieces, where each token typically averages around four characters or roughly 1.25 tokens per word for English. This compression ratio can vary with different languages, generally being less efficient for languages with more logographic characters or complex morphology.

AI models like GPT-3.5 and GPT-4 have a defined ‘context window’, which limits the amount of text (in tokens) they can consider at one time. For GPT-3.5, this window is about 16,000 tokens, while the newer GPT-4 Turbo series can handle up to 128,000 tokens. This capacity determines how much information the model can process in one go, including the output it produces (which here seems short).

The cost of using these AI models is directly related to the number of tokens processed. Whether analyzing text, generating content, or conducting classifications, each token that is processed incurs a cost. Therefore, efficient token usage is not only a technical requirement but also a financial consideration. Again, there is no reuse of instructions, so every call is a bill by how much input you place.

I thought I’d ask an AI to make a wild guess about how much this explanation would consume, also having it write code for calculations:

The 18-page document contains approximately 4,950 words. Based on the token calculation, it would require about 6,188 tokens to be processed by the GPT-4 AI language model.

So you can paste plain text here and see what you are working with and if it can be placed in a model all at once, and what that model might be: Tiktokenizer

OpenAI has batch processing jobs where you can submit a special file of all the API calls you want performed, with a 24 hour turnaround, and at 50% the cost. These are independent calls, just automated.

1 Like

Having spent quite some time on classification related tasks (single and multi-classification) involving LLMs, notably OpenAI’s models, my experience is with this amount of categories you are likely going to run into challenges by just using in-context learning approaches.

Fine-tuned models are capable to deal with this amount of categories. Additionally, there are options to use embeddings for classification tasks, which are a much cheaper and faster alternative.

Now before considering either as an option, here comes a key point. In both cases, you need a reasonable amount of training data or reference examples. In one of my fine-tuned models involving multi-classification with up to 100 different categories, I created between 20-50 examples for each classification category. While the fine-tuned model works pretty well, I do not want to downplay the amount of work that went into creating the training data. If this is a one-off task with “just” a few hundred dialogue extracts, then you might end up being faster and more accurate doing a quick manual labelling. However, if you are looking at creating a recurring process, then the amount of work may well be worth it.

I’m all for using LLMs but as also pointed out by @_j , there’s significant complexity involved with this amount of categories - even more so if they are not easily to distinguish.

1 Like

OK, thank you for the swift and informative responses! The total token count of the instructions explaining the 66 categories comes to 11797, so you were not too far off.

There exists some pre-labelled examples of dialogue that also number in the hundreds (in total, not per category), but the labelling process is very time-consuming, so the researchers have contacted me to see if I can help automate the process for the rest of them. It sounds like using fine-tuned models, with effectively using a training data set might help here, but for 66 categories then the training data set would probably need to have >1000 examples minimum?

There is a parallel classification task on the same data set looking to classify the dialogue examples into another mode of categorisation, this time with 5 total categories instead of 66. Again, there are a few hundred pre-labelled instances: I don’t have a prepared document from the researchers explaining the 5 labels, but I should think that they could make one easily enough that would probably be shorter. Would this be a realistic number of categories to successfully classify at this scale of data?

Five is a much more doable task. You could give this a try with the regular gpt-4 or a gpt-4-turbo model without fine-tuning and just some detailed instructions. Otherwise, fine-tuning with the amount of examples you have available should be a good start.

The number of training examples is not an exact science. If dialogues for a given category are extremely similar, then a smaller amount of examples per category could work. On the other hand, if the heterogeneity is high, then you need a more diverse set of training examples by category.

One other point worth noting: I am assuming you are intending to use the OpenAI endpoint for fine-tuning? If you are using Azure OpenAI for fine-tuning, then the costs for fine-tuning are significantly higher as you are required to also pay for hosting the models in addition to their consumption. That’s another important consideration when evaluating whether a fine-tuned model is the right choice.

One more thing that will be relevant: the instructions for the task will be in English, but all the dialogue examples are in Norwegian. Will there be a noticeable difference in non-English classification ability between gpt models?

I think it should not matter but you might want to run a few examples to double check.