Retain past responses in memory without sending them again at every API request

Hello,

I’m trying to use to use the API for gpt-3.5-turbo and gpt-4 in order to elicit a number of responses based on a (very long) initial set of instructions and then individual sentences. The instructions explain to the model what to do with the input sentences. The interface would look as follows

  • Very long initial instruction detailing what to do with each input + Input 1
  • → GPT output
  • Input 2
  • → GPT output
  • Input 3
  • → GPT output
    and so on.

In the user interface it’s very easy to do this, however in the python API, every time I make a call to openai.ChatCompletion.create it seems to require the full chat history in order to remember things. This obviously eats up a lot of tokens (and most importantly money!). Number of tokens is not an issue as I’m still below the 8k limit for GPT4, however having to pay to send the full set of instructions as input tokens for every API request (while this is not required when accessing through the UI) is annoying me.

So my question is: is there a way to use the python API interactively, a bit like the UI, so that a GPT model remembers previous answers and instructions without having to send the full set of instructions with every single input request, thus wasting unnecessary money? Happy to provide a short reproducible example if the question is not clear enough.

1 Like

Hello. This is how it works - ie you need to append to keep context. ChatGPT is likely doing this on the backend too along with maybe summarizing extremely old messages in the chat-chain.

So, the best way to battle costs is to be lean and mean with your prompts… keep editing until you get a bare minimum that you need to get your task done.

1 Like

So no way to do it without having to send the previous chat?

no way.
Think about the logic.

If my app doesnt require the Chat History. Then how?
so the ChatGPT-Turbo still trying to remember it? As If this is a default-feature ( Auto remember chat history )
If this happen, do you know the current ChatGPT API workload can be 10x higher.

— NO, impossible. –
— Everytime you call the API, there is no chat-session ID as far as i know.
— Even there is SessionID, Where to store the Chat History? 30 days? 60 days?
— So will it cost extra money to store the chat history?
— i still need to opt out the chathistory feature if i dont need it for certain use case…

So i guess, this is what OPENAI doing now:
No Auto Chat History and the 8k token limit per call
= use it at your extra cost. ( not everybody else ) ( call and record your own chat session history )
= limit at 8k, not 80k so you still getting a good response time.

---- i would opt for higher token limit. BUT Not the by-default-chat history ON
---- because not every scenario we needed the chat history.

If you are using langchain consider using map-reduce as a chain type. Since it summarize everything before sending them to openAI for chat completion. If you use stuff, it means it takes everything and as time goes the chat history keeps appending and become big. Which after a time it wont remember.

OpenAI requires prompt and chat history to predict the next token. Since it is not made to remember anything, rather than complete a prompt based on what user send to it. Thats why we save history out of openAI, but we summarize it with instructions and then it to openAI, and it will act like it is remembering but in reality its what you are sending currently

Sending the entire chat history every time makes sense in order to provide the context of the chat conversation to GPT. But, Why I must pay for the previous messages again and again? IMO, GPT should only count tokens for the new (last) message.

Giving an examply by ChatGPT (ChatGPT is likely doing this on the backend too) doesn’t make sense because ChatGPT doesn’t pay for tekens like us as API consumers.

You misunderstand. The previous responses are being added to the current prompt as part of the entire question. For this reason, including 3 previous responses of 500 tokens, plus a current prompt at 500 tokens, would result in 2000 tokens needing to be evaluated – the exact same cost / processing power as submitting a fresh 2000 token prompt.

If you’d like to avoid this, summarize the former communications and include them (requiring far fewer tokens) with your current prompt, rather than supplying past prompts.

It’s because there is no temporary storage inside the model that would keep the conversation around to be continued at some random time in the future.
All the model can do is process inputs into outputs, immediately.

That’s when the explanation of @DevGirl comes in.

1 Like

I really understand that. My concern is all about paying for the message more than once.

Let’s imaging a chat of
user: A
assistant: B

At the first rount, I’ll pay for A as input and B for output.

user: C
assistant: D

At this point I won’t pay to only C and D but for A and B too.

Why is that?

Think of it like a detective solving a mystery.

Each conversation with OpenAI’s API is like a new clue in the detective’s casebook. The detective (the AI) needs to see all the previous clues (the entire message history) to understand the current context and solve the case (respond accurately).

If you only show the detective the latest clue without the context of the previous ones, they might jump to the wrong conclusion or miss the bigger picture. So, while it feels like you’re paying for the same clues again, you’re actually ensuring that the detective stays on the right track.

I do understand the importance of sending the entire chat history

My concern is only about, why i should pay for the message that I alread paid in its turn? imagine a chat with 10 user messages and 10 assistant messages. the first message is paid 10 times.

That’s because every time a new message is added to the conversation, OpenAI has to run the entire conversation through the model. As the number of tokens increases, the cost of inference increases.

If OpenAI decides not to charge for the entire history, it must calculate an average cost for every inference. One chat might have two messages, one might have 8, and one might have more than 16.

So they will have to charge more money even if you have to send two messages in the chat.

So, the method of billing they use currently gives us more flexibility.

You have to consider your average cost per conversation and how you can optimize it.

Optimization techniques include offloading summarization to a smaller model.

Example: If you are using GPT-4, after every four messages, you can use GPT-3.5-turbo to create a summary of the conversation, and you can utilize that summary instead of sending all previous messages. If done pragmatically, it can reduce the cost exponentially.

2 Likes