I like the responses API though it is not really production grade yet. I understand that, when previous_message_id is given, all the previous inputs are charged. Fair enough.
Now, what I don’t understand is how the input tokens are calculated. I need to know this so that I can budget users’ conversations accordingly.
In the following example, from the 2nd row onwards, the exact same system prompt was used, with every other parameter being EXACTLY the same. The output varies a bit, & that’s ok. I am just trying to figure out the proportional increase: it doesn’t make any mathematical sense to me, even if I apply a % for every message sent!
For the 1st row below, it is very low as it was the 1st message and there was no “previous response”
Are you passing the “system” prompt through a role, or using the instruction parameter?
It is usually not needed more than 1 system(developer) prompt (as a role) at the very first turn, and it will be preserved in the next ones.
Notice though, that the instruction parameter is volatile (is valid only for each request) and will not be carried over even with the use of previous_response_id.
What it seems is that in addition to the previous conversation, the difference might be that you are perhaps adding a new system prompt of 100~200 tokens (through role or instruction).
You are sending the same system message and/or user messages each time: actual result is 137 tokens of text if a single message, or 134 tokens of input if there are two messages (the containers take 4 tokens per message, then the final prompt for the AI to write is 3 tokens).
Since the initial input is 148, but the following inputs continue to add about that much, I’d conclude there is only one prompt message, no constant system message in “instructions” or only as the first chat turn of input.
Since you say you are using “exact same system prompt”: I conclude you are using the input messages incorrectly. You do not continue to send a system/developer message if using the server-side state, unless you want a chat history full of duplicated system messages for every user input!
Either send “system” in the input message list only once when starting a new session, or use the “instructions” API parameter to constantly insert a system message prefix before any chat history.
Turn
inputTokens
outputTokens
Δ input vs prev
prev‑assistant + 4
newest‑prompt (wrapper + content)
newest‑prompt content
1
148
775
–
–
unknown
unknown
2
1 071
849
923
779
144
140
3
2 068
814
997
853
144
140
4
3 030
905
962
818
144
140
5
4 083
904
1 053
909
144
140
6
5 135
910
1 052
908
144
140
7
6 193
937
1 058
914
144
140
8
7 278
947
1 085
941
144
140
9
8 373
947
1 095
951
144
140
A conversation state continues to be fed back into the model. Here’s an example with understandable traceable figures:
system: 96 tokens + 4 token overhead = 100
user 1: 146 tokens + 4 token overhead = 150
assistant 1: if 46 tokens generated from before = add 50
user 2: 46 tokens of prompt text = add 50 tokens
prompt: 3 tokens TOTAL: 353
Turn 3:
Keep piling on the new messages, with no management of length offered to you except your choice of error or finally dropping some messages at the model’s maximum input (which can be a million tokens).
Note: the only party encouraging you to use this server side state is OpenAI. You are locking yourself into their platform, trusting them to maintain your data and not lock you out or ban you from your organization account. Besides the fact it has no limitation to the length of recurring chat input.