I think that’s what he meant.

Imagine following system prompt:

If the given user input has the word ice cream in it answer with a 4000 word story about ice cream otherwise answer with the letter A

I mean that’s a primitive one, but I guess you get the point.

You can’t really know the output length of the model. All you can know is what would be the maximum.

You can know the output length when receiving, because the person making the API request has specified the output length reservation they want for forming the answer by providing the “max_tokens” parameter.

It is rather that your chatbot needs an interface or intelligence to set this when calling.

The API really needs an endpoint alias and redocumentation of max_tokens as “context_output_reservation”.

But that would require a minimum length.

Extract all words that start with the letter A of the user input will have dynamic length.

You can request 1000 tokens as long as you want.

The max_token parameter just sets the maximum.

That’s a good idea. We use an open source reverse proxy that we modified to “sign” the body/payload of the message so only requests from our app can use our API key. ( Ton of API Key abuse out there)
We could modify this code to do what you’re talking about. Thanks.

Here’s the repository if anyone is interested. We haven’t updated the docs with how to use the “secret” but is pretty strait forward. GitHub - bickster/openai-forward: 🚀 OpenAI API Reverse Proxy · ChatGPT API Proxy

I’m not getting how you would picture how such a “token adaptive model” would work.

You want a model selector, that operates per the current API specifications, that upgrades to the larger model when the user mis-specifies with more input tokens and more output reservation than the default context length?

Or rather do you see specifying “infinity” max tokens and then some smartness happens when the base model ends with a length message? It already would have run the small model up to the max to discover. And you’d need a completion engine that can be reloaded and do proper completions. Otherwise you get ChatGPT’s poor quality “continue” button being auto pressed. And reloading another instance with your first 4k is probably not going to save compute resources over just the large model to start with.

Why would it be more expensive? Unless I’m misunderstanding how we get charged. If you get back an error that says you’ve exceeded the max token length you are not charged or only charged for the prompts you sent not the potential response that you never received.

Basic MVP Workflow

  • Use 8k turbo model until you get a token max error
  • Take the prompts and responses from that request and send to 16k turbo model
  • Use 16k turbo model until you get a token max error
  • Take the prompts and responses from that request and send to 32k turbo model (if you have access)

Check my response to @elmstedt I added an MVP Workflow. I’m sure there are optimizations to make, but it’s an idea to get a non scalable solution going…

You can calculate the tokens prior sending.

1 Like

You can only calculate the prompts or payload your are sending. You don’t know the response length a head of time. Right?

You can assume it.
Depending on the prompt and given the model answers as desired.
If your prompt is good and you use the model for data analysis you have a good chance.

1 Like

A good way of describing it:

  • you don’t know at what length the AI will be inspired to write a response, yet,
  • you are required to specify ahead of time the amount of the model’s context length that is set aside as the portion that will be exclusively used for generating that response.

yes, we understand the idea behind max_token.
But what has this to do with the topic?

Yes. Does anyone see any issues with the “MVP Workflow” besides there will be a short delay between each model switch. Obviously, it would be nice if open ai could handle this, but until then, do think the idea could work?

https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

Beside that, which makes it possible to implement that workflow I see accepting 1MB prompts on a 8k token model as a bug (if it does that like it does on azure. I havn’t tested since it would charge me to do so).

I mean on gpt-35, let’s say you are trying to summarize 3500 1MB+ files in parallel which is the possible maximum that might lead to 3500 failed request with 3.5 billion token.
Which would cost like 7k per minute.

That might also lead to some sort of ddos.

The problem is that API users cannot suppress the special tokens that are added to delineate roles, nor can they turn off the unseen “assistant” prompt and carriage return that is created before the AI answers.

This means that for us, there can be no true completion of another model’s truncated output, where you can continue where you left off - without those interruptions and token interjections that break coherency.

Since you probably maxed out the context of the smaller model with your own inputs, underestimating max_tokens, and not so much what the AI generated…you can just ask again.

1 Like

Because you can’t switch models midstream, it’s just not possible.

While you can absolutely know the response token limit ahead of time, you cannot know how many tokens the model would want to use for a response until it gets there.

So, the only way to have the 16k-model continue from where the 4k-model ended would be to take the inputs and outputs from the 4k-model and use them as inputs to the 16k-model.

Now, of you mean something different, like starting out using the 4k-model, then when the context length reaches 4k-tokens continue the chat with the 16k-model… Well, that’s easy enough to do and something anyone could have done at any time via the API so I assumed that’s not what you meant.

1 Like

That’s what @elmstedt said.
And I totally agree. Although I am sure with a few million of investment and some time there still is a way.

Or a little less than $1 mil: a “continue as assistant” flag that doesn’t include the final role closure or additional injection.

I am generally not happy with long output anyways. That’s why I am using multiple workers/agents to get results based on weighted criteria and subcriteria and let the model give me confidency and a score and work with standard derivation and variants on software level to enhance the results.
Also letting another agent find the possible source from two given sources (chunks of software yay) that fits to the output is a way to get better results.

Like in partial credit assignment or psychological assesments.

But it has a price.

1 Like

Yes. Anyone can do what I’m suggesting and could implement it to save some $$. My goal with this topic was to:

  1. Bring the idea of an adaptive model to the community for discussion and hope that open ai feels it’s a feature they should/could add.
  2. Propose a simple solution to accomplish the goal of optimizing token spend.