We would like to use the gpt-3.5-turbo model up to the maximum token length and then if a prompt or response goes over that limit have the “open ai system” automatically switch to gpt-3.5-turbo-16k. There are a lot of users who don’t have a session that reaches the gpt-3.5-turbo max token length. Plus, this would allow us to pay the gpt-3.5-turbo price for the first 4000 tokens and then the gpt-3.5-turbo-16k price for anything after. We’ve implemented this in our app, but would be super nice if it just happened with the API.
This is an interesting—if impractical—idea given how LLMs actually work.
The only way this could realistically work would be to re-send the entire context including the interrupted response to the new model. Which would get costly, you’d be better off using the larger model to begin with.
That said, you could do some cool things with dynamic model selection and memory retrieval where you might choose the 16k model if the embeddings you retrieved happened to need more tokens than would be practical with the 8k model.
I see a good chance to have a server cluster acting as a loadbalancer/proxy in front of the models that does some jobs like calculating the incoming prompts tokens which is perfectly possible without calling the model and also grouping some parallel requests, take the tokens of all of them and respond with a token limit reached response before the model is hit.
I bet it would be cheaper for openai as well if it is not in action atm - thoose server clusters don’t need fancy TPU.
Also from my own experience (not on openai but on azure) sending 1MB+ prompts to a model multiple times should not lead to a “the models capacity ends at 8k you sent a million” errors coming from the model. Just measure the number of characters or limit the upload/request size.
For the response I suppose there would be alot more involved. Much more than you would be willing to pay for that service then.
The AI model will not magically exceed its requested context length. The user’s input lets us know exactly the size requirement: input plus reserved output (max_tokens). The only little tweak is that the routing has to happen after the tokenizer, or be informed by another performed on the edge.
The trick for OpenAI is to bill the same for offering the autoroute service themselves…of getting inputs to a model with context length of lowest size.
You can know the output length when receiving, because the person making the API request has specified the output length reservation they want for forming the answer by providing the “max_tokens” parameter.
It is rather that your chatbot needs an interface or intelligence to set this when calling.
The API really needs an endpoint alias and redocumentation of max_tokens as “context_output_reservation”.
That’s a good idea. We use an open source reverse proxy that we modified to “sign” the body/payload of the message so only requests from our app can use our API key. ( Ton of API Key abuse out there)
We could modify this code to do what you’re talking about. Thanks.
I’m not getting how you would picture how such a “token adaptive model” would work.
You want a model selector, that operates per the current API specifications, that upgrades to the larger model when the user mis-specifies with more input tokens and more output reservation than the default context length?
Or rather do you see specifying “infinity” max tokens and then some smartness happens when the base model ends with a length message? It already would have run the small model up to the max to discover. And you’d need a completion engine that can be reloaded and do proper completions. Otherwise you get ChatGPT’s poor quality “continue” button being auto pressed. And reloading another instance with your first 4k is probably not going to save compute resources over just the large model to start with.
Why would it be more expensive? Unless I’m misunderstanding how we get charged. If you get back an error that says you’ve exceeded the max token length you are not charged or only charged for the prompts you sent not the potential response that you never received.
Basic MVP Workflow
Use 8k turbo model until you get a token max error
Take the prompts and responses from that request and send to 16k turbo model
Use 16k turbo model until you get a token max error
Take the prompts and responses from that request and send to 32k turbo model (if you have access)
Yes. Does anyone see any issues with the “MVP Workflow” besides there will be a short delay between each model switch. Obviously, it would be nice if open ai could handle this, but until then, do think the idea could work?
Beside that, which makes it possible to implement that workflow I see accepting 1MB prompts on a 8k token model as a bug (if it does that like it does on azure. I havn’t tested since it would charge me to do so).
I mean on gpt-35, let’s say you are trying to summarize 3500 1MB+ files in parallel which is the possible maximum that might lead to 3500 failed request with 3.5 billion token.
Which would cost like 7k per minute.
The problem is that API users cannot suppress the special tokens that are added to delineate roles, nor can they turn off the unseen “assistant” prompt and carriage return that is created before the AI answers.
This means that for us, there can be no true completion of another model’s truncated output, where you can continue where you left off - without those interruptions and token interjections that break coherency.
Since you probably maxed out the context of the smaller model with your own inputs, underestimating max_tokens, and not so much what the AI generated…you can just ask again.
Because you can’t switch models midstream, it’s just not possible.
While you can absolutely know the response token limit ahead of time, you cannot know how many tokens the model would want to use for a response until it gets there.
So, the only way to have the 16k-model continue from where the 4k-model ended would be to take the inputs and outputs from the 4k-model and use them as inputs to the 16k-model.
Now, of you mean something different, like starting out using the 4k-model, then when the context length reaches 4k-tokens continue the chat with the 16k-model… Well, that’s easy enough to do and something anyone could have done at any time via the API so I assumed that’s not what you meant.