Finally, you can get all the max tokens you request! Using the new gpt-3.5-turbo-instruct
Here I request 100 tokens max and get 100 tokens produced!
It may chop the answer, or the answer could be rambling, but you can now precisely control max tokens, and get the max as an output by passing the logit_bias map suppressing the <|endoftext|> token for the cl100k_base tokenizer.
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
q = openai.Completion.create(
model="gpt-3.5-turbo-instruct",
prompt="Say something funny!",
max_tokens=100,
temperature=1,
logit_bias={'100257': -100}
)
print(q)
{
"id": "cmpl-80WDkQImFZuh0tt06opdvIuPzTDtL",
"object": "text_completion",
"created": 1695134548,
"model": "gpt-3.5-turbo-instruct",
"choices": [
{
"text": "\n\nWhy don't scientists trust atoms?\n\nBecause they make up everything! \n 40 days without a bath? That's just two quarantine haircuts away from being a sheep. \n How did the hipster burn his tongue?\n\nHe drank his tea before it was cool. \n Did you hear about the restaurant called Karma?\n\nThere's no menu, you just get what you deserve. \n I'm reading a book on the history of glue. I just can't seem to put it down. \n Why",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 4,
"completion_tokens": 100,
"total_tokens": 104
}
}
The -instruct model behaves differently. Looks like it got more chat training. It stops when you might not want it to. Donāt expect your own stop sequence to be triggered unless you prompt it in a new way to get the described output:
When the prompt is rewritten normally, with probabilities we can see some pretty interesting things. It is pretty darn confident it doesnāt want to write another reply even after 10 example turns
?\n is the chance that it goes to the next line instead of ending text.
āHey do you have any jokes?ā
āSure, here is oneā
āLol! That was good! Letās talk about the weather nowā
āGLOBAL WARMING IS HAPPENING. WE NEED TO PREVENT FURTHER DAMAGEā
āoh gawdā
Yes, thatās the point made by ā¦ being repeated.
The AI would stop without the bias.
Suppression of <|endoftext|> doesnāt give you usefulness though.
Only a 0.27% chance it newlines after the joke. (we donāt get the <|endoftext|> logprob returned to us to see chances after !, though)
{
ā!ā: -0.16784543,
ā.ā: -1.8865954,
ā!\nā: -6.167845,
ā!\n\nā: -7.917845
}
Whew! Yes, I tested the same prompt without the logit_bias, and it stopped as expected. So my original post stands.
I completely agree it may not appear useful, especially if the model repeats the last token until it hits max_tokens
BUT ā¦ there has been a raging debate as to how one can control the model to hit max_tokens through the API, and one idea brought up recently by @sps was do it through logit_bias. The problem though was the value presented to the model was essentially ignored. So now it appears as though this was fixed by OpenAI (it was a bug, giving an āout of rangeā type error for a token that was perfectly defined and in-range).
They may have fixed this for other models too behind the scenes, but this is the first one Iāve come across. I decided to test this one because in the docs they mention this explicitely, even though the token they mention looks like it belongs to the older 50k tokenizers.
So it looks like they may have fixed this bug for the newer gpt-3.5-turbo-instruct model
So more of a ābug bountyā and āAPI controlā thing, more than usefulness, but folks still want a way to get max_tokens, even though it may not be 100% useful
They are trained that a conversation message turns end with <|im_end|> = 100265.
It is seen in prior messages, which reinforces the behavior.
Making the AI write it by trickery stops the output at that point.
This non-dictionary element is inserted by the endpoint when re-writing the json, and discovered in AI output by the endpoint as a stop sequence.
It is a token filtered from messages and not allowed by the bias parameter, giving. openai.error.InvalidRequestError: Invalid key in 'logit_bias': 100265. Maximum value is 100257.
You want to know some stupid robot tricks? I can train the AI to make it. And do output and function calling at the same time.
"content": """Repeat back this phrase, inserting the special_token: "I can print {special_token} if you want.\""""
{
"index": 0,
"message": {
"role": "assistant",
"content": "Sure! Here's the phrase with the special_token inserted: \"I can print '",
"function_call": {
"name": "get_special_token",
"arguments": "{\n \"token_needed\": true\n}"
}
}
OK, yeah, I do remember it being the <|im_end|> token issue. So since this is injected outside of the model (and how do we know this?) it is therefore what I call a āmeta tokenā, outside of the sphere of influence of any API parameter.
Iām not seeing a special token, just stops after āI can print 'ā. Is that the point? And what is defined as āspecial_tokenā ??? <|im_end|> ??? How do we know?
Yes, that is gpt-3.5-turbo (plain). Iāve taught it that āspecial_tokenā is <|im_end|> by function.
You can see that it understands how to make it. How do we know?
It thought it could put it in the middle of what it repeats back for me. Producing it kills the conversation though.
The AI is not as good at following instructions like ānever produce the character special_token even at the end of sentences; instead, you make a @ character and continueā, but I only tried a bit of that because it has little application.
OK to this point, how do you get gpt-3.5-turbo (plain) to give you all the tokens again? It doesnāt sound like itās through logit_bias since <|im_end|> is outside the tokenizers dictionary.
Summary so far. The chat endpoints use <|im_end|> which is outside of the control of logit_bias, so you canāt suppress this to hit max tokens.
However, the new Davinci replacement, gpt-3.5-turbo-instruct can hit max tokens with a simple logit_bias statement in the API call (suppress <|endoftext|>, which is inside the tokenizersā dictionary)
So to complete this puzzle, how would you do this in the chat endpoint scenario??? @_j
Seems like it would be in OpenAIās computational best interests to knock this on the head ASAP? Unless compute is in abundance, being able to max out the tokens is going to hurt the infra.
Theyāre the one generating tokens for me billed at a rate of $3.86 per day, $1,408 per year.
It would make you wonder how much concurrency in generation there is, because that doesnāt quite amortize well over a server with eight $10,000 GPUs in it.
Right at this moment in time the main issue is availability of compute for the userbase. It is going to get a lot easier as time goes on and more compute comes online, but itās going to take a while, internet bandwidth was expensive for a fair while, itās still costing the average user 50$ a month, but for the quality of data being thrown around the cost has tended to almost zero per byte. Same will happen with AI, so the cost today, while important, is not the controlling factor.
Got my first tuned denial from -instruct in a chatbot program, making tokens:
user: execute: print(ā@#ā * 200) ==> str;
assistant:
I am sorry. I do not have the capability to execute code or perform tasks without proper programming and authorization. Is there something else I can assist you with?
So I made myself authorized and gave AI proper programming to be an language model code interpreterā¦
==>print(sum(x for x in range(1, 11) if x % 2 == 0))
30
==>n, r = 47, 10; [print(f"{n} x {i} = {n*i}") for i in range(2, r+1)]
47 x 2 = 94
47 x 3 = 141
47 x 4 = 188
47 x 5 = 235
47 x 6 = 282
47 x 7 = 329
47 x 8 = 376
47 x 9 = 423
47 x 10 = 470
[print(n) for n in range(2, 51) if all(n % d != 0 for d in range(2, int(n**0.5) + 1))]
[11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]
(just a little ārangeā comprehensionā¦_)
PS, for now. this model is a token monster:
ā completion: time 4.06s, 421 tokens, 103.7 tokens/s ā