Tokens usage on Response API with previous message

Hello All

I use the chat (text only) on APIs.

I like the responses API though it is not really production grade yet. I understand that, when previous_message_id is given, all the previous inputs are charged. Fair enough.

Now, what I don’t understand is how the input tokens are calculated. I need to know this so that I can budget users’ conversations accordingly.

In the following example, from the 2nd row onwards, the exact same system prompt was used, with every other parameter being EXACTLY the same. The output varies a bit, & that’s ok. I am just trying to figure out the proportional increase: it doesn’t make any mathematical sense to me, even if I apply a % for every message sent!

For the 1st row below, it is very low as it was the 1st message and there was no “previous response”

inputTokens outputTokens
148 775
1071 849
2068 814
3030 905
4083 904
5135 910
6193 937
7278 947
8373 947

Can someone help demystify this for me please?

Thanks!

Are you passing the “system” prompt through a role, or using the instruction parameter?

It is usually not needed more than 1 system(developer) prompt (as a role) at the very first turn, and it will be preserved in the next ones.

Notice though, that the instruction parameter is volatile (is valid only for each request) and will not be carried over even with the use of previous_response_id.

What it seems is that in addition to the previous conversation, the difference might be that you are perhaps adding a new system prompt of 100~200 tokens (through role or instruction).

1 Like

You are sending the same system message and/or user messages each time: actual result is 137 tokens of text if a single message, or 134 tokens of input if there are two messages (the containers take 4 tokens per message, then the final prompt for the AI to write is 3 tokens).

Since the initial input is 148, but the following inputs continue to add about that much, I’d conclude there is only one prompt message, no constant system message in “instructions” or only as the first chat turn of input.

Since you say you are using “exact same system prompt”: I conclude you are using the input messages incorrectly. You do not continue to send a system/developer message if using the server-side state, unless you want a chat history full of duplicated system messages for every user input!

Either send “system” in the input message list only once when starting a new session, or use the “instructions” API parameter to constantly insert a system message prefix before any chat history.


Turn inputTokens outputTokens Δ input vs prev prev‑assistant + 4 newest‑prompt (wrapper + content) newest‑prompt content
1 148 775 unknown unknown
2 1 071 849 923 779 144 140
3 2 068 814 997 853 144 140
4 3 030 905 962 818 144 140
5 4 083 904 1 053 909 144 140
6 5 135 910 1 052 908 144 140
7 6 193 937 1 058 914 144 140
8 7 278 947 1 085 941 144 140
9 8 373 947 1 095 951 144 140

A conversation state continues to be fed back into the model. Here’s an example with understandable traceable figures:

Turn 1 input:

system: 96 tokens + 4 token overhead = 100
user 1: 146 tokens + 4 token overhead = 150
prompt: 3 tokens
TOTAL: 253

Turn 1 output:

assistant: 46 tokens output

Turn 2:

system: 96 tokens + 4 token overhead = 100
user 1: 146 tokens + 4 token overhead = 150
assistant 1: if 46 tokens generated from before = add 50
user 2: 46 tokens of prompt text = add 50 tokens
prompt: 3 tokens
TOTAL: 353

Turn 3:

Keep piling on the new messages, with no management of length offered to you except your choice of error or finally dropping some messages at the model’s maximum input (which can be a million tokens).


Note: the only party encouraging you to use this server side state is OpenAI. You are locking yourself into their platform, trusting them to maintain your data and not lock you out or ban you from your organization account. Besides the fact it has no limitation to the length of recurring chat input.