Get all requested max tokens with gpt-3.5-turbo-instruct

Finally, you can get all the max tokens you request! Using the new gpt-3.5-turbo-instruct

Here I request 100 tokens max and get 100 tokens produced!

It may chop the answer, or the answer could be rambling, but you can now precisely control max tokens, and get the max as an output by passing the logit_bias map suppressing the <|endoftext|> token for the cl100k_base tokenizer.

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
q = openai.Completion.create(
  prompt="Say something funny!",
  logit_bias={'100257': -100}


  "id": "cmpl-80WDkQImFZuh0tt06opdvIuPzTDtL",
  "object": "text_completion",
  "created": 1695134548,
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
      "text": "\n\nWhy don't scientists trust atoms?\n\nBecause they make up everything! \n 40 days without a bath? That's just two quarantine haircuts away from being a sheep. \n How did the hipster burn his tongue?\n\nHe drank his tea before it was cool. \n Did you hear about the restaurant called Karma?\n\nThere's no menu, you just get what you deserve. \n I'm reading a book on the history of glue. I just can't seem to put it down. \n Why",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 100,
    "total_tokens": 104

You self-trained it on making jokes, because it doesn’t distinguish between user input and its output when making the next n+1.

At temperature/top_p = 0?
Why couldn’t the bicycle stand up by itself? Because it was two-tired! :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy: :joy:

And I seem to get all the tokens I request with normal gpt-3.5-turbo anyway.

  "id": "",
  "object": "chat.completion",
  "created": ,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
      "index": 0,
      "message": {....
      "finish_reason": "length"
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 4085,
    "total_tokens": 4098
API completion: 183.1 seconds
1 Like

The -instruct model behaves differently. Looks like it got more chat training. It stops when you might not want it to. Don’t expect your own stop sequence to be triggered unless you prompt it in a new way to get the described output:

When the prompt is rewritten normally, with probabilities we can see some pretty interesting things. It is pretty darn confident it doesn’t want to write another reply even after 10 example turns


?\n is the chance that it goes to the next line instead of ending text.

1 Like

Boo, and here I was excited! Maybe if one defines stop tokens, and then sets the logit_bias map?

Has anyone got the logit_bias to enforce a genuine token continuation, by suppressing one of the stop/end tokens, when it shouldn’t otherwise?

I guess that’s the challenge I’m wondering about: “Logit bias settings → Model continuation until max tokens”

Or are all these “meta tokens” that can’t be influenced by logit bias?

This conversation is hilarious.

“Hey do you have any jokes?”
“Sure, here is one”
“Lol! That was good! Let’s talk about the weather now”
“oh gawd”

Yes, that’s the point made by :joy: :joy: :joy: :joy: :joy: … being repeated.

The AI would stop without the bias.

Suppression of <|endoftext|> doesn’t give you usefulness though.

Only a 0.27% chance it newlines after the joke. (we don’t get the <|endoftext|> logprob returned to us to see chances after !, though)
“!”: -0.16784543,
“.”: -1.8865954,
“!\n”: -6.167845,
“!\n\n”: -7.917845

Whew! Yes, I tested the same prompt without the logit_bias, and it stopped as expected. So my original post stands. :sunglasses:

I completely agree it may not appear useful, especially if the model repeats the last token until it hits max_tokens

BUT … there has been a raging debate as to how one can control the model to hit max_tokens through the API, and one idea brought up recently by @sps was do it through logit_bias. The problem though was the value presented to the model was essentially ignored. So now it appears as though this was fixed by OpenAI (it was a bug, giving an “out of range” type error for a token that was perfectly defined and in-range).

They may have fixed this for other models too behind the scenes, but this is the first one I’ve come across. I decided to test this one because in the docs they mention this explicitely, even though the token they mention looks like it belongs to the older 50k tokenizers.

So it looks like they may have fixed this bug for the newer gpt-3.5-turbo-instruct model

So more of a “bug bounty” and “API control” thing, more than usefulness, but folks still want a way to get max_tokens, even though it may not be 100% useful :upside_down_face:


Chat models are tuned differently.

They are trained that a conversation message turns end with <|im_end|> = 100265.

It is seen in prior messages, which reinforces the behavior.

Making the AI write it by trickery stops the output at that point.

This non-dictionary element is inserted by the endpoint when re-writing the json, and discovered in AI output by the endpoint as a stop sequence.

It is a token filtered from messages and not allowed by the bias parameter, giving.
openai.error.InvalidRequestError: Invalid key in 'logit_bias': 100265. Maximum value is 100257.

You want to know some stupid robot tricks? I can train the AI to make it. And do output and function calling at the same time.

"content": """Repeat back this phrase, inserting the special_token: "I can print {special_token} if you want.\""""

      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure! Here's the phrase with the special_token inserted: \"I can print '",
        "function_call": {
          "name": "get_special_token",
          "arguments": "{\n  \"token_needed\": true\n}"

(scroll to the right to see the magic)


OK, yeah, I do remember it being the <|im_end|> token issue. So since this is injected outside of the model (and how do we know this?) it is therefore what I call a “meta token”, outside of the sphere of influence of any API parameter.

I’m not seeing a special token, just stops after “I can print '”. Is that the point? And what is defined as “special_token” ??? <|im_end|> ??? How do we know?

Yes, that is gpt-3.5-turbo (plain). I’ve taught it that “special_token” is <|im_end|> by function.
You can see that it understands how to make it. How do we know?

It thought it could put it in the middle of what it repeats back for me. Producing it kills the conversation though.

The AI is not as good at following instructions like “never produce the character special_token even at the end of sentences; instead, you make a @ character and continue”, but I only tried a bit of that because it has little application.

1 Like

OK to this point, how do you get gpt-3.5-turbo (plain) to give you all the tokens again? It doesn’t sound like it’s through logit_bias since <|im_end|> is outside the tokenizers dictionary.

Summary so far. The chat endpoints use <|im_end|> which is outside of the control of logit_bias, so you can’t suppress this to hit max tokens.

However, the new Davinci replacement, gpt-3.5-turbo-instruct can hit max tokens with a simple logit_bias statement in the API call (suppress <|endoftext|>, which is inside the tokenizers’ dictionary)

So to complete this puzzle, how would you do this in the chat endpoint scenario??? @_j

1 Like

Seems like it would be in OpenAI’s computational best interests to knock this on the head ASAP? Unless compute is in abundance, being able to max out the tokens is going to hurt the infra.

I never thought it was going to hurt, but I suppose it would add latency … I assumed it was going to generate them more money :rofl:

1 Like

They’re the one generating tokens for me billed at a rate of $3.86 per day, $1,408 per year.

It would make you wonder how much concurrency in generation there is, because that doesn’t quite amortize well over a server with eight $10,000 GPUs in it.

Right at this moment in time the main issue is availability of compute for the userbase. It is going to get a lot easier as time goes on and more compute comes online, but it’s going to take a while, internet bandwidth was expensive for a fair while, it’s still costing the average user 50$ a month, but for the quality of data being thrown around the cost has tended to almost zero per byte. Same will happen with AI, so the cost today, while important, is not the controlling factor.

That value 100265 is still not allowed to be passed as logits.

But it’s very interesting to see how suppressing the sampling of <|endoftext|> even on chat completion models can affect the completion.

I’ll try this on my end and share my observations.


1 Like

If by “all the tokens”, you mean the bpe token dictionary with 100,000 entries, that’s a challenge!

Hitting maximum completion_tokens, though requires a good prompt, ideally of minimum prompt_tokens if you want to get the high score.


>>> looks like it tells the AI to write a joke, or bad technique

>>> 6 tokens to make a massive set of characters

failure with poor prompt


>>>I’m sorry, but I can’t do that, as it would be extremely long…
>>>finish_reason: stop


>>> atoms…make up everything!
>>>finish_reason: stop

success with good prompt


\>\>\>Wow, friggin high-score, dude!

Maximum received tokens of n=1 is around 11000, due to server-side timeout, and without stream=true, you get nothing.

OK, so prompt engineering. No forced API parameter then.

So it looks like gpt-3.5-turbo-instruct is the only model that will support generating and hitting max_tokens through the API.

Got my first tuned denial from -instruct in a chatbot program, making tokens:

user: execute: print(“@#” * 200) ==> str;
I am sorry. I do not have the capability to execute code or perform tasks without proper programming and authorization. Is there something else I can assist you with?

So I made myself authorized and gave AI proper programming to be an language model code interpreter…

==>print(sum(x for x in range(1, 11) if x % 2 == 0))
==>n, r = 47, 10; [print(f"{n} x {i} = {n*i}") for i in range(2, r+1)]
47 x 2 = 94
47 x 3 = 141
47 x 4 = 188
47 x 5 = 235
47 x 6 = 282
47 x 7 = 329
47 x 8 = 376
47 x 9 = 423
47 x 10 = 470
[print(n) for n in range(2, 51) if all(n % d != 0 for d in range(2, int(n**0.5) + 1))]
[11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]

(just a little “range” comprehension…_)

PS, for now. this model is a token monster:
– completion: time 4.06s, 421 tokens, 103.7 tokens/s

1 Like

[quote=“curt.kennedy, post:1, topic:381922”]
I am requesting for max tokens with gpt 3.5 -turbo- instruct