Getting around "max_tokens"

The max_tokens parameter is a bit of a pain, in the sense that you need to know the number of tokens in your prompt, so as not to ask for more than 2049 tokens.

Is there any solution to allow the API to just stop when it gets to 2049 tokens, and not specifying max_tokens? Loading GPT2 tokenizer just to find number of tokens in the text seems like an overkill for this. Since response has ‘stop reason’, I’d expect there’s some workaround.

Thank you,


There is no way to increase max tokens but here are some posts about creating longer completions.

Also based on what your saying it seems like you dont need 2048 tokens so maybe just decrease it past what your prompt will be?


Thank you, for the answer and references!
I don’t actually want to increase the number of tokens. I’m just asking if there’s an API solution to gracefully return from a request where you accidentally request more tokens than the engine can do. Instead of exiting with an exception.
API currently “makes you” select a max_number of tokens, and since my prompt lengths are varying, it’s something i’d like to not compute on the fly every time.

Hi @alex_g

Programmatically counting the number of tokens and then setting max_tokens seems like the only way to go for now.

Also, when you say ‘gracefully’, it sounds like this is more of an error handling problem than an API one.


Hello, Everyone
I am also facing like @alex_g problem.
My max token is 2048. but responsive text’s token is 240~300.
When I check the responsive, finish reason is “stop”.
is there any solution for increate the tokens? I want only get max_token(2049) by one api request.
If you know any solution, please help me
Thanks in advacne.

I ended up running the fast gpt2 tokenizer, from transformers library.

from transformers import AutoTokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)

text_in = "bla bla"
tokens = gpt2_tokenizer.tokenize(text_in)
1 Like

Shouldn’t you use tiktoken? Is there some way to get the token limit for a particular model through the API itself instead of hardcoding it?

If you want to use a bigger context window an option is to divide the context in chunks, do multiple api calls, and the sum all the answers in one. You can define a function to do it manually, or you can use a library like langchain to handle the process for you.

def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def generate_exercises(prompt, model=modelo, max_length=2048):
global api_calls_count
chunks = list(chunker(prompt, max_length))
responses =

for chunk in chunks:
    response = openai.ChatCompletion.create(
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": chunk}
    api_calls_count += 1

return "".join(responses)