Request: Query for a models max tokens

When working with the OpenAPI models endpoint it would be quite nice to be able to directly query the models max number of tokens.

This is useful to avoid hard coding in the model(s) max token vals to compare against my own tokenized version of a users input prior to submission. This is to avoid users submitting prompts to OpenAI that exceed the model length.

Edit -
I just queried ‘gpt-3.5-turbo’ with a length of 5975 tokens. This was done to produce an error of some sort stating that I have exceeded the limit of 4096 tokens per the documentation. Instead the error stated that I was awarded an additional token since the max model token length is 4097 apparently. Thank you OpenAI :smiley:

message: "This model's maximum context length is 4097 tokens. However, your messages resulted in 5975 tokens. Please reduce the length of the messages."
4 Likes

Maybe. But the token length is fixed for each model. So if you know the model you are using you know the number of tokens it has. So hard coding the model size causes no real software offense.

1 Like

There is no offence; but there are nuisances with just going with ‘hard coding’ based approach.

  1. It would be nice to be able to just query an endpoint after selecting a model from the /models to be able to know ahead of time what the input limits to that model are so that before sending off queries using bandwidth. The application can tokenize the users input to determine the number of tokens consumed leading to an overall reduction in bandwidth.

  2. Dynamic model selections. Lets say a user inputs text that has a token count that is larger than the default model, but you know other models support the same feature with longer token lengths. Your application could select another compatible model on the fly from the list of available models even if that model isn’t in your hard coded list.

  3. You already can just make a bogus query to an endpoint to get the limit, just send a number of tokens that would overflow the largest model and the error message will state the number of tokens that model can input.

Edit-

Going even deeper models such as text-davinci-edit-001 do not have their token limit stated anywhere though it could be assumed to be the same as davinci-3

3 Likes

If the tokens are fixed per modal I think all three of your scenarios are still satisfied by knowing their token length.

If you come across a modal that you are not familiar with and don’t know its length, then according to your 2nd scenario you have bigger problems like what features it has. Which, as you say, you would otherwise “know”.

We code against fixed APIs and query for variability. Consider the token length part of the fixed API which I think it is all intense and purposes it is.

Finally, I have been watching feature suggestions come and go in this forum. Very few if any are responded to by anyone with the influence to address them. And I can’t say that any have resulted in an actual change. So you may also consider the pragmatism of what I suggest.

I hope this helps :four_leaf_clover::+1:t3:

1 Like

We are trying to build a data pipeline where the OpenAi API would be used to generate embeddings for our data analysis, in our case like said in the first comment we would switch between models with respect to our incoming data and I see OpenAI updating the models and changing the token limits so, hardcoding the limit would be troublesome

The API would be highly useful

1 Like

I haven’t seen OpenAI randomly changing the context window length / number of max tokens on the API side, yet. Such a change would likely have massive impact on many deployed solutions and would be a unexpected move.
You should be able to create a config file with the model specifications and if this this highly unlikely event should occur, at least you can make your changes very easily.

Also, note the ‘edit’ in OP’s post, where they create an error message from the API which informs you about the max tokens for the particular model. You can run something like this regularly, extract the values and will be informed if you need to update your app’s config.

But, as I said, this is not a likely scenario. You can focus your efforts on other parts of your project.

1 Like

+1 for this feature – it’s pretty amazing that I can’t just ask for llm.max_tokens – wait, I can but it doesn’t mean what you think it means :slight_smile:

In this case the model to use is a side-effect of user preferences and the exact data that’s being operated on, so it can’t simply be hard-coded. (And yes, I want the down-stream code to “just work” in future when GPT5 or whatever comes out and that is added to the config.)

Here is chat model data, which was programmatically extracted from error messages, using models listed by the models endpoint, which included fine-tune model probing.

Unfortunately the error reports for 3.5 models are different than for 4.0 with the same inputs. Getting the API to report the maximum output length restriction on newer 3.5 requires an input + max_tokens API call that GPT-4-turbo models could complete instead of erroring out on.

It is just a short list of fixed data, so I just edited the three models affected for this list (instead of writing and publishing even more-informed code for you to bang on the API yourself.)

model_context = {
  "gpt-3.5-turbo": {
    "context": 16385,
    "max_out": 4096
  },
  "gpt-3.5-turbo-0125": {
    "context": 16385,
    "max_out": 4096
  },
  "gpt-3.5-turbo-0301": {
    "context": 4097,
    "max_out": 4097
  },
  "gpt-3.5-turbo-0613": {
    "context": 4097,
    "max_out": 4097
  },
  "gpt-3.5-turbo-1106": {
    "context": 16385,
    "max_out": 4096
  },
  "gpt-3.5-turbo-16k": {
    "context": 16385,
    "max_out": 16385
  },
  "gpt-3.5-turbo-16k-0613": {
    "context": 16385,
    "max_out": 16385
  },
  "gpt-4": {
    "context": 8192,
    "max_out": 8192
  },
  "gpt-4-0125-preview": {
    "context": 128000,
    "max_out": 4096
  },
  "gpt-4-0314": {
    "context": 8192,
    "max_out": 8192
  },
  "gpt-4-0613": {
    "context": 8192,
    "max_out": 8192
  },
  "gpt-4-1106-preview": {
    "context": 128000,
    "max_out": 4096
  },
  "gpt-4-1106-vision-preview": {
    "context": 128000,
    "max_out": 4096
  },
  "gpt-4-32k": {
    "context": 32768,
    "max_out": 32768
  },
  "gpt-4-32k-0314": {
    "context": 32768,
    "max_out": 32768
  },
  "gpt-4-32k-0613": {
    "context": 32768,
    "max_out": 32768
  },
  "gpt-4-turbo-preview": {
    "context": 128000,
    "max_out": 4096
  },
  "gpt-4-vision-preview": {
    "context": 128000,
    "max_out": 4096
  }
}

from which you can extract:

>>>model_context["gpt-4"]["context"]
8192

On models with the same context as max, there is no artificial restriction on output, but you can’t actually set max_tokens that high. Instead you’d do something like (max_tokens=2000) - 8192 = maximum 6192 input tokens you can send, including overhead.