ChatGPT api maximum token

Zeki · March 4, 2023, 12:57am

The ChatGPT API Documentation says send back the previous conversation to make it context aware, this works fine for short form conversations but when my conversations are longer I get the maximum token is 4096 error. if this is the case how can I still make it context aware despite of the messages length?
I have seen other ChatGPT clones using the api and I have tested them with long message and they are context aware, how are they doing it?

ruby_coder · March 4, 2023, 1:03am

They / we use various methods to truncate, summarize and otherwise insure the tokens count is below the limit.

FYI, chat completions from the API contain the token usage numbers and you can track this in your app as your chat session progresses.

I update and store the token usage numbers in a DB with each API call.

tzekid · March 4, 2023, 10:04am

Care to share your tricks / methods about how you summarize the messages/convos?

ruby_coder · March 4, 2023, 10:30am

Hi @tzekid

I am running a bit behind of finalizing a new chatbot based on the new chat API method due to family priorities and “honey do” tasks around the home, so am not 100% finished yet with a new chatbot based on the just released chat API method.

What I have done so far is to create two DB tables, one is for the conversations and the other is for the chat messages and reply.

When the API returns from a reply, it provides a full token usage count and I store this in the DB. Then, when I continue to the conversation, I take the token count in the DB and add my token estimate of my new messages sent to the API.

If the total estimated token count is greater than the 4K permitted, I have a number of strategies to consider and test, but I have not had time yet to fully code and test:

Potential Pruning Strategies

Delete “role: system” messages, since they are weak and the conversation is already ongoing (so far have not had great results with the system role anyway, but I have not fully tested either),
Truncate the messages starting with the oldest (brute force).
Use max_tokens and summarize stored messages using a different model.
Remove prior “role: assistant” messages.

What do you think? Any other ideas?

After all, this is one of the most interesting part of the new chat completion coding challenge, at least in my view, and is a good topic in a forum for developers

Thanks!

tzekid · March 4, 2023, 12:05pm

Thanks for the great response!

I only started playing with the API last night. The only solution I’ve currently spun up is a very simple “summarize convo” function that simply calls the GPT-3.5-Turbo API. The resulting convo then only has one system and one assistant response (with the convo summary so far).

The problem right now is that a lot of accuracy is lost (it’s worse than ChatGPT’s implementation), which is really annoying for my use-case (assisted foreign language learning).

I’m going to try to summarize the conversation only if it hits/might hit the 4k total token limit. That’s a great idea. And/or I could try to summarize everything but the last user and last assistant response.

By the way, wouldn’t it make more sense to remove the prior “role: user” messages? Depending on the use-case, with a simple “role: system” prompt, GPT should infer the user’s message or context in its answers, right?

Cheers!

ruby_coder · March 4, 2023, 12:10pm

I’m not sure. It’s something to test, of course.

Keep in mind that generative AI does not really “infer”, so not sure what you mean about that. Generative AI is just a fancy, autocompletion engine which generates text based on probability. I don’t think that puts it in the “inference” category, it’s just autocompleting like your annoying text autocompletion engine in your favorite text editor.

Some would even argue that generative AI is not really AI at all.

Some view generative AI as more of a “babbler” than a bona-fide AI

When I work with all OpenAI models, I only view them as text prediction engines and nothing more.

HTH

Zeki · March 4, 2023, 1:00pm

Thanks you guys!

I think checking available token and truncating the message starting from the oldest one is the best option (somehow I tried it but I was just sending only the last message with the current one - It didn’t work out tho[not always context aware and would still pass the limit]).
in this case, one scenario to take into account would be;- for example what if the total estimated token count for current prompt is almost like around 4k that means it might not gonna allow you to add any previous conversation(even the last(recent) one).

and for summarizing I see two problems:-

we may have to send another API request (may alter the response speed)
still it could pass the limit at some point

I will try the first one and share the result…

ruby_coder · March 5, 2023, 12:21pm

After working on this for hours today, easily seeing fatal errors for exceeding 4096 way too much. I am going to experiment with stripping out all assistant replies from the API.

This 4096 token limit is VERY restrictive!

tzekid · March 6, 2023, 9:09pm

I’ve had some success with summarizing in my tests yesterday.

Three things that I noticed:

You need to tweak your “summarizing prompt” to get the compression / accuracy ratio you want.
(In my testing) For longer conversations I used ca. 10ish % less tokens in total compared to sending the whole conversation the whole time (even if you technically have double the API calls).
You need to tweak the token cut-off point (i.e. when you’re going to summarize the convo) to your use-case. For me, I found that for casual conversations I can basically always summarize and there’s next to no loss in response accuracy for quite a while — but for generating and iterating on text (e.g. generate an email, or generate a snippet of code) I need to be smarter about when / how I summarize the convo.

@ruby_coder with “infer” I mean that in the more casual way, i.e. that the model sometimes repeats the things it finds important in its response. E.g. “Give me 5 facts about Africa → Here are 5 facts about Africa…”

Right now I’m playing around with the “temperature” and “top_p” parameters and try to figure out what kind of effects they have on the output

Zeki · March 8, 2023, 3:35pm

sanjaymk · March 9, 2023, 6:50pm

fyi, in addition to storing conv history in a local db, langchain does one additional optimization thats worth mentioning. It summarizes past history:
https://langchain.readthedocs.io/en/latest/modules/memory/types/summary_buffer.html
So, the prompt I am supplying is actually a summary of past history.

OrkestraOnline · March 13, 2023, 6:06pm

Hey ! After spending hours on this I simply removed the max token from my request inputs and it seems to allow for “unlimited” tokens.

AgusPG · March 13, 2023, 6:15pm

It does not. There is still a default value underneath (4096 - prompt tokens)

OrkestraOnline · March 13, 2023, 6:48pm

Apologies you’re right… I just realized I had my max token limit set wrong before, which led to very short echanges… false alert !

PixelPenguin · March 26, 2023, 4:19pm

Why doesn’t Open AI publish some high level guidance on how they accomplish this with ChatGPT? I’m not even asking for code, but just some guidance or pseudo-code. Wtf does the word “open” in Open AI mean if they don’t want to share literally any insight into what they’re building.

ahrvoje · July 14, 2023, 8:04am

In general, one way to reduce ‘system’ context (for cost, traffic or limit reasons) is to use embeddings to identify the most relevant sections of the context and use only those for ChatCompletion query.

More info in this interesting notebook:

github.com

openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3b0435cb",
   "metadata": {},
   "source": [
    "# Question answering using embeddings-based search\n",
    "\n",
    "GPT excels at answering questions, but only on topics it remembers from its training data.\n",
    "\n",
    "What should you do if you want GPT to answer questions about unfamiliar topics? E.g.,\n",
    "- Recent events after Sep 2021\n",
    "- Your non-public documents\n",
    "- Information from past conversations\n",
    "- etc.\n",
    "\n",
    "This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference text.\n",
    "\n",

This file has been truncated. show original

Topic		Replies	Views
I wish that when using the GPT API, it would be possible to have a contextual conversation like chatGPT API	14	7125	December 18, 2023
Need more than a 4097 token call from chat gpt api API	7	3222	November 28, 2023
Getting ChatGPT to Remember Previous Chat Messages Prompting	37	68982	January 29, 2024
ChatGPT seems to exceed the token limit and not lose memory API	3	3722	July 9, 2024
Maintain the context within the 4096 max tokens API	2	2334	February 16, 2024

ChatGPT api maximum token

Potential Pruning Strategies

Related topics