The ChatGPT API Documentation says send back the previous conversation to make it context aware, this works fine for short form conversations but when my conversations are longer I get the maximum token is 4096 error. if this is the case how can I still make it context aware despite of the messages length?
I have seen other ChatGPT clones using the api and I have tested them with long message and they are context aware, how are they doing it?
They / we use various methods to truncate, summarize and otherwise insure the tokens count is below the limit.
FYI, chat completions from the API contain the token usage numbers and you can track this in your app as your chat session progresses.
I update and store the token usage numbers in a DB with each API call.
Care to share your tricks / methods about how you summarize the messages/convos?
Hi @tzekid
I am running a bit behind of finalizing a new chatbot based on the new chat
API method due to family priorities and âhoney doâ tasks around the home, so am not 100% finished yet with a new chatbot based on the just released chat
API method.
What I have done so far is to create two DB tables, one is for the conversations and the other is for the chat messages and reply.
When the API returns from a reply, it provides a full token usage count and I store this in the DB. Then, when I continue to the conversation, I take the token count in the DB and add my token estimate of my new messages sent to the API.
If the total estimated token count is greater than the 4K permitted, I have a number of strategies to consider and test, but I have not had time yet to fully code and test:
Potential Pruning Strategies
-
Delete ârole: systemâ messages, since they are weak and the conversation is already ongoing (so far have not had great results with the system role anyway, but I have not fully tested either),
-
Truncate the messages starting with the oldest (brute force).
-
Use
max_tokens
and summarize stored messages using a different model. -
Remove prior ârole: assistantâ messages.
What do you think? Any other ideas?
After all, this is one of the most interesting part of the new chat
completion coding challenge, at least in my view, and is a good topic in a forum for developers
Thanks!
Thanks for the great response!
I only started playing with the API last night. The only solution Iâve currently spun up is a very simple âsummarize convoâ function that simply calls the GPT-3.5-Turbo API. The resulting convo then only has one system and one assistant response (with the convo summary so far).
The problem right now is that a lot of accuracy is lost (itâs worse than ChatGPTâs implementation), which is really annoying for my use-case (assisted foreign language learning).
Iâm going to try to summarize the conversation only if it hits/might hit the 4k total token limit. Thatâs a great idea. And/or I could try to summarize everything but the last user and last assistant response.
By the way, wouldnât it make more sense to remove the prior ârole: userâ messages? Depending on the use-case, with a simple ârole: systemâ prompt, GPT should infer the userâs message or context in its answers, right?
Cheers!
Iâm not sure. Itâs something to test, of course.
Keep in mind that generative AI does not really âinferâ, so not sure what you mean about that. Generative AI is just a fancy, autocompletion engine which generates text based on probability. I donât think that puts it in the âinferenceâ category, itâs just autocompleting like your annoying text autocompletion engine in your favorite text editor.
Some would even argue that generative AI is not really AI at all.
Some view generative AI as more of a âbabblerâ than a bona-fide AI
When I work with all OpenAI models, I only view them as text prediction engines and nothing more.
HTH
Thanks you guys!
I think checking available token and truncating the message starting from the oldest one is the best option (somehow I tried it but I was just sending only the last message with the current one - It didnât work out tho[not always context aware and would still pass the limit]).
in this case, one scenario to take into account would be;- for example what if the total estimated token count for current prompt is almost like around 4k that means it might not gonna allow you to add any previous conversation(even the last(recent) one).
and for summarizing I see two problems:-
- we may have to send another API request (may alter the response speed)
- still it could pass the limit at some point
I will try the first one and share the resultâŚ
After working on this for hours today, easily seeing fatal errors for exceeding 4096 way too much. I am going to experiment with stripping out all assistant
replies from the API.
This 4096 token limit is VERY restrictive!
Iâve had some success with summarizing in my tests yesterday.
Three things that I noticed:
- You need to tweak your âsummarizing promptâ to get the compression / accuracy ratio you want.
- (In my testing) For longer conversations I used ca. 10ish % less tokens in total compared to sending the whole conversation the whole time (even if you technically have double the API calls).
- You need to tweak the token cut-off point (i.e. when youâre going to summarize the convo) to your use-case. For me, I found that for casual conversations I can basically always summarize and thereâs next to no loss in response accuracy for quite a while â but for generating and iterating on text (e.g. generate an email, or generate a snippet of code) I need to be smarter about when / how I summarize the convo.
@ruby_coder with âinferâ I mean that in the more casual way, i.e. that the model sometimes repeats the things it finds important in its response. E.g. âGive me 5 facts about Africa â Here are 5 facts about AfricaâŚâ
Right now Iâm playing around with the âtemperatureâ and âtop_pâ parameters and try to figure out what kind of effects they have on the output
fyi, in addition to storing conv history in a local db, langchain does one additional optimization thats worth mentioning. It summarizes past history:
https://langchain.readthedocs.io/en/latest/modules/memory/types/summary_buffer.html
So, the prompt I am supplying is actually a summary of past history.
Hey ! After spending hours on this I simply removed the max token from my request inputs and it seems to allow for âunlimitedâ tokens.
It does not. There is still a default value underneath (4096 - prompt tokens)
Apologies youâre right⌠I just realized I had my max token limit set wrong before, which led to very short echanges⌠false alert !
Why doesnât Open AI publish some high level guidance on how they accomplish this with ChatGPT? Iâm not even asking for code, but just some guidance or pseudo-code. Wtf does the word âopenâ in Open AI mean if they donât want to share literally any insight into what theyâre building.
In general, one way to reduce âsystemâ context (for cost, traffic or limit reasons) is to use embeddings to identify the most relevant sections of the context and use only those for ChatCompletion query.
More info in this interesting notebook: