It is important to understand that almost all language models, and all OpenAI, are stateless input/output machines. You give language such as instructions, include data, behaviors, what to write based on that, and receive your output. The memory of the input is immediately gone.
Your categories are the largest challenge, needing thorough understanding all at once. This can either be providing āpromptā, input context message with all the information required to perform a task, or one can develop just as many example responses as you have data, that you could use to fine-tune an AI model, reducing the amount of actual instructions and category description quality, because some of the decision making could be built-in or inferred.
You would be pushing the model high into its capabilites to have decisions made on such a large classification instruction. In the domain of artificial intelligence, particularly when working with advanced language models such as GPT-3.5 and GPT-4, understanding the concept of ātokensā is crucial. Tokens represent the basic units of processing for these models. They are a way of breaking down input text and generating output. This tokenization involves dividing text into manageable pieces, where each token typically averages around four characters or roughly 1.25 tokens per word for English. This compression ratio can vary with different languages, generally being less efficient for languages with more logographic characters or complex morphology.
AI models like GPT-3.5 and GPT-4 have a defined ācontext windowā, which limits the amount of text (in tokens) they can consider at one time. For GPT-3.5, this window is about 16,000 tokens, while the newer GPT-4 Turbo series can handle up to 128,000 tokens. This capacity determines how much information the model can process in one go, including the output it produces (which here seems short).
The cost of using these AI models is directly related to the number of tokens processed. Whether analyzing text, generating content, or conducting classifications, each token that is processed incurs a cost. Therefore, efficient token usage is not only a technical requirement but also a financial consideration. Again, there is no reuse of instructions, so every call is a bill by how much input you place.
I thought Iād ask an AI to make a wild guess about how much this explanation would consume, also having it write code for calculations:
The 18-page document contains approximately 4,950 words. Based on the token calculation, it would require about 6,188 tokens to be processed by the GPT-4 AI language model.
So you can paste plain text here and see what you are working with and if it can be placed in a model all at once, and what that model might be: Tiktokenizer
OpenAI has batch processing jobs where you can submit a special file of all the API calls you want performed, with a 24 hour turnaround, and at 50% the cost. These are independent calls, just automated.