ChatGPT api maximum token

The ChatGPT API Documentation says send back the previous conversation to make it context aware, this works fine for short form conversations but when my conversations are longer I get the maximum token is 4096 error. if this is the case how can I still make it context aware despite of the messages length?
I have seen other ChatGPT clones using the api and I have tested them with long message and they are context aware, how are they doing it?


They / we use various methods to truncate, summarize and otherwise insure the tokens count is below the limit.

FYI, chat completions from the API contain the token usage numbers and you can track this in your app as your chat session progresses.

I update and store the token usage numbers in a DB with each API call.



Care to share your tricks / methods about how you summarize the messages/convos? :slight_smile:

Hi @tzekid

I am running a bit behind of finalizing a new chatbot based on the new chat API method due to family priorities and “honey do” tasks around the home, so am not 100% finished yet with a new chatbot based on the just released chat API method.

What I have done so far is to create two DB tables, one is for the conversations and the other is for the chat messages and reply.

When the API returns from a reply, it provides a full token usage count and I store this in the DB. Then, when I continue to the conversation, I take the token count in the DB and add my token estimate of my new messages sent to the API.

If the total estimated token count is greater than the 4K permitted, I have a number of strategies to consider and test, but I have not had time yet to fully code and test:

Potential Pruning Strategies

  • Delete “role: system” messages, since they are weak and the conversation is already ongoing (so far have not had great results with the system role anyway, but I have not fully tested either),

  • Truncate the messages starting with the oldest (brute force).

  • Use max_tokens and summarize stored messages using a different model.

  • Remove prior “role: assistant” messages.

What do you think? Any other ideas?

After all, this is one of the most interesting part of the new chat completion coding challenge, at least in my view, and is a good topic in a forum for developers :+1:




Thanks for the great response! :slight_smile:

I only started playing with the API last night. The only solution I’ve currently spun up is a very simple “summarize convo” function that simply calls the GPT-3.5-Turbo API. The resulting convo then only has one system and one assistant response (with the convo summary so far).

The problem right now is that a lot of accuracy is lost (it’s worse than ChatGPT’s implementation), which is really annoying for my use-case (assisted foreign language learning).

I’m going to try to summarize the conversation only if it hits/might hit the 4k total token limit. That’s a great idea. And/or I could try to summarize everything but the last user and last assistant response.

By the way, wouldn’t it make more sense to remove the prior “role: user” messages? Depending on the use-case, with a simple “role: system” prompt, GPT should infer the user’s message or context in its answers, right?

Cheers! :slight_smile:


I’m not sure. It’s something to test, of course.

Keep in mind that generative AI does not really “infer”, so not sure what you mean about that. Generative AI is just a fancy, autocompletion engine which generates text based on probability. I don’t think that puts it in the “inference” category, it’s just autocompleting like your annoying text autocompletion engine in your favorite text editor.

Some would even argue that generative AI is not really AI at all.

Some view generative AI as more of a “babbler” than a bona-fide AI :slight_smile:

When I work with all OpenAI models, I only view them as text prediction engines and nothing more.



1 Like

Thanks you guys!

I think checking available token and truncating the message starting from the oldest one is the best option (somehow I tried it but I was just sending only the last message with the current one - It didn’t work out tho[not always context aware and would still pass the limit]).
in this case, one scenario to take into account would be;- for example what if the total estimated token count for current prompt is almost like around 4k that means it might not gonna allow you to add any previous conversation(even the last(recent) one).

and for summarizing I see two problems:-

  1. we may have to send another API request (may alter the response speed)
  2. still it could pass the limit at some point

I will try the first one and share the result…

1 Like

After working on this for hours today, easily seeing fatal errors for exceeding 4096 way too much. I am going to experiment with stripping out all assistant replies from the API.

This 4096 token limit is VERY restrictive!


1 Like

I’ve had some success with summarizing in my tests yesterday.

Three things that I noticed:

  1. You need to tweak your “summarizing prompt” to get the compression / accuracy ratio you want.
  2. (In my testing) For longer conversations I used ca. 10ish % less tokens in total compared to sending the whole conversation the whole time (even if you technically have double the API calls).
  3. You need to tweak the token cut-off point (i.e. when you’re going to summarize the convo) to your use-case. For me, I found that for casual conversations I can basically always summarize and there’s next to no loss in response accuracy for quite a while — but for generating and iterating on text (e.g. generate an email, or generate a snippet of code) I need to be smarter about when / how I summarize the convo.

@ruby_coder with “infer” I mean that in the more casual way, i.e. that the model sometimes repeats the things it finds important in its response. E.g. “Give me 5 facts about Africa → Here are 5 facts about Africa…” :slight_smile:

Right now I’m playing around with the “temperature” and “top_p” parameters and try to figure out what kind of effects they have on the output :man_shrugging:


1 Like

fyi, in addition to storing conv history in a local db, langchain does one additional optimization thats worth mentioning. It summarizes past history:
So, the prompt I am supplying is actually a summary of past history.


Hey ! After spending hours on this I simply removed the max token from my request inputs and it seems to allow for “unlimited” tokens. :joy:

It does not. There is still a default value underneath (4096 - prompt tokens) :slight_smile:

1 Like

Apologies you’re right… I just realized I had my max token limit set wrong before, which led to very short echanges… false alert !

1 Like

Why doesn’t Open AI publish some high level guidance on how they accomplish this with ChatGPT? I’m not even asking for code, but just some guidance or pseudo-code. Wtf does the word “open” in Open AI mean if they don’t want to share literally any insight into what they’re building.


In general, one way to reduce ‘system’ context (for cost, traffic or limit reasons) is to use embeddings to identify the most relevant sections of the context and use only those for ChatCompletion query.

More info in this interesting notebook: