But that would require a minimum length.

Extract all words that start with the letter A of the user input will have dynamic length.

You can request 1000 tokens as long as you want.

The max_token parameter just sets the maximum.

That’s a good idea. We use an open source reverse proxy that we modified to “sign” the body/payload of the message so only requests from our app can use our API key. ( Ton of API Key abuse out there)
We could modify this code to do what you’re talking about. Thanks.

Here’s the repository if anyone is interested. We haven’t updated the docs with how to use the “secret” but is pretty strait forward. GitHub - bickster/openai-forward: 🚀 OpenAI API Reverse Proxy · ChatGPT API Proxy

I’m not getting how you would picture how such a “token adaptive model” would work.

You want a model selector, that operates per the current API specifications, that upgrades to the larger model when the user mis-specifies with more input tokens and more output reservation than the default context length?

Or rather do you see specifying “infinity” max tokens and then some smartness happens when the base model ends with a length message? It already would have run the small model up to the max to discover. And you’d need a completion engine that can be reloaded and do proper completions. Otherwise you get ChatGPT’s poor quality “continue” button being auto pressed. And reloading another instance with your first 4k is probably not going to save compute resources over just the large model to start with.

Why would it be more expensive? Unless I’m misunderstanding how we get charged. If you get back an error that says you’ve exceeded the max token length you are not charged or only charged for the prompts you sent not the potential response that you never received.

Basic MVP Workflow

  • Use 8k turbo model until you get a token max error
  • Take the prompts and responses from that request and send to 16k turbo model
  • Use 16k turbo model until you get a token max error
  • Take the prompts and responses from that request and send to 32k turbo model (if you have access)

Check my response to @elmstedt I added an MVP Workflow. I’m sure there are optimizations to make, but it’s an idea to get a non scalable solution going…

You can calculate the tokens prior sending.

1 Like

You can only calculate the prompts or payload your are sending. You don’t know the response length a head of time. Right?

You can assume it.
Depending on the prompt and given the model answers as desired.
If your prompt is good and you use the model for data analysis you have a good chance.

1 Like

A good way of describing it:

  • you don’t know at what length the AI will be inspired to write a response, yet,
  • you are required to specify ahead of time the amount of the model’s context length that is set aside as the portion that will be exclusively used for generating that response.

yes, we understand the idea behind max_token.
But what has this to do with the topic?

Yes. Does anyone see any issues with the “MVP Workflow” besides there will be a short delay between each model switch. Obviously, it would be nice if open ai could handle this, but until then, do think the idea could work?

https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

Beside that, which makes it possible to implement that workflow I see accepting 1MB prompts on a 8k token model as a bug (if it does that like it does on azure. I havn’t tested since it would charge me to do so).

I mean on gpt-35, let’s say you are trying to summarize 3500 1MB+ files in parallel which is the possible maximum that might lead to 3500 failed request with 3.5 billion token.
Which would cost like 7k per minute.

That might also lead to some sort of ddos.

The problem is that API users cannot suppress the special tokens that are added to delineate roles, nor can they turn off the unseen “assistant” prompt and carriage return that is created before the AI answers.

This means that for us, there can be no true completion of another model’s truncated output, where you can continue where you left off - without those interruptions and token interjections that break coherency.

Since you probably maxed out the context of the smaller model with your own inputs, underestimating max_tokens, and not so much what the AI generated…you can just ask again.

1 Like

Because you can’t switch models midstream, it’s just not possible.

While you can absolutely know the response token limit ahead of time, you cannot know how many tokens the model would want to use for a response until it gets there.

So, the only way to have the 16k-model continue from where the 4k-model ended would be to take the inputs and outputs from the 4k-model and use them as inputs to the 16k-model.

Now, of you mean something different, like starting out using the 4k-model, then when the context length reaches 4k-tokens continue the chat with the 16k-model… Well, that’s easy enough to do and something anyone could have done at any time via the API so I assumed that’s not what you meant.

1 Like

That’s what @elmstedt said.
And I totally agree. Although I am sure with a few million of investment and some time there still is a way.

Or a little less than $1 mil: a “continue as assistant” flag that doesn’t include the final role closure or additional injection.

I am generally not happy with long output anyways. That’s why I am using multiple workers/agents to get results based on weighted criteria and subcriteria and let the model give me confidency and a score and work with standard derivation and variants on software level to enhance the results.
Also letting another agent find the possible source from two given sources (chunks of software yay) that fits to the output is a way to get better results.

Like in partial credit assignment or psychological assesments.

But it has a price.

1 Like

Yes. Anyone can do what I’m suggesting and could implement it to save some $$. My goal with this topic was to:

  1. Bring the idea of an adaptive model to the community for discussion and hope that open ai feels it’s a feature they should/could add.
  2. Propose a simple solution to accomplish the goal of optimizing token spend.

Lovely idea but ultimately doesn’t make sense. Here’s why:

When you send off your payload to the OpenAI API, it first estimates the required context length by counting the input tokens as well as the “max_tokens” parameter you’ve supplied. (Additionally, you may notice “hidden” tokens which you don’t see, these format the layout of the request for the raw model itself, they look like ‘<|im_start|>’ etc. and can be seen here: https://tiktokenizer.vercel.app/)

If you exceed the expected block size (a.k.a context / sequence length / T value etc.) (8, 16, 32K) for that model - it will reject your request with a useful error. Note that the request itself is rejected, it’s never actually sent to the model itself.

The workaround is to count your token + max_token usage in advance and determine which model to select prior to sending off the payload (which you’ve specified that you already do, great work!).

Now the explanation:

The reason your idea makes no sense from a technical perspective is that once computation starts on one model, it’s not transferrable to another model. They are independent programs - they’re specialised at their specific context length.

What your requirement states is that you wish to execute and run computation and have an alternative model complete the calculation but this won’t make sense as the two models speak a completely separate language from each other.

The cheapest option overall is to continue what you’re already doing - I think the only way it would make sense is to have a flag on the API to allow the higher context model be selected if required as users will be expecting certain responses which either model will provide; For example, the latest version of these models had a few hiccups regarding the new function calling semantics and the models became erratic as such.

If the solution were implemented exactly as you stated (definitely makes sense to me from a non-technical perspective), I can only imagine you would be charged more than what would be reasonably expected if the 16k model were chosen in the first place.

Hopefully that makes sense?

Makes sense.

Here’s why I still think it makes sense for most users.

  • A lot of users don’t use or need the full 8k to get what they want so why pay for a 16k model
  • I don’t think the computation has to be transferable. You just have to send the entire history to the new model. I would expect the response to be very similar and solve the problem that the prompt is asking regardless of the model content size. Especially at a 16k content prompt you are past a “few-shots” and are at “many-shots”. At this point you are fine tuning the model for what you want.