I have a large text (~9k words). I have to do 4 tasks from this data using a prompt. I have developed prompts for all the tasks and it works fine. My question is that as the context windows of gpt-3.5-turbo-16k is just 16k tokens but I am afraid that my response can exceed 16k tokens in some cases. So what should be the best practice for such scenario. I want to send the text data to API only once and I do not want to send data 4 time for 4 different prompts. The response from my API is sent back to another application that uses it. I am bit confused here because I can develop 4 different APIs for this task but I will have to send the data four time.
What I want:
- Send the text data once to gpt-3.5-turbo-16k.
- Perform the tasks and return a single response.
Is there any technique/tips for handling such scenario?
Sorry if I’m misunderstand what you said, but the 16k token window is the prompt+response size, so if you’re prompt is roughly 9k + some instruction, then the response will never be more that 5k plus change.
Unless the whole text is needed for all 4 tasks, I would suggest using embeddings to embed the text and use only those portions for each task that you need.
Depending on the complexity of the task, getting GPT to do 4 tasks at once might lead to more errors as in my experience, the more different things you ask of it, the more accuracy drop there is for all 4 task. My suggestion would be to use 4 different API calls, but with smaller context windows to minimise the cost at your end.
You got me right. When I ask to do all 4 tasks in one prompt the result is not acceptable and the model does not understand the instructions.
I can not use embeddings as I need all the textual data for analysis ( I want analytical information beyond just Q&As).
I am sure that 4 API calls are all I need but I was thinking if I can do this with a single call. Lets say I send the data once and then do not send the data again so only data I send in the following calls should only contain instructions for the task the contextual data should be in the memory (or some other storage medium).
The quality of inference will drop with longer and longer context contents. Your instructions don’t mean as much when there’s also a massive token dump to also consider.
You can make the exact tasks and output as clear as possible, like a system prompt:
AI instructions: output generation tasks:
- The AI will print a summary of the entire text the user has provided, then
- The AI will print an outline for the article, then
- The AI will print metadata search keywords for the article as a python list, then
- The AI will print four categories for classifying the text.
And then the user data in the user message.
To really reinforce the instruction at the top, another system message can be repeated at the bottom, or the instruction injected after the demarcated user text.
Also consider that the output that is generated is also more tokens that affect or inform the remaining tokens. Process in an order that maximizes comprehension.
As to the initial question about “exceeding”: the max_tokens is the amount of context length that is specifically for generating an output. Then the input must fit into the remaining space. You can use a local tokenizer to measure the size of the input, and reserve the maximum output permissible, although a full output such as “improve the grammar of these 12k tokens” must necessarily be truncated.