Identical request input results in different input token counts in the dashboard

maximummaggus · October 15, 2024, 10:29am

I’m encountering an issue where identical Chat Completions API (model gpt-4o-2024-08-06) requests are resulting in different input token counts on the OpenAI dashboard. Specifically, I’m sending the exact same input with no changes to the prompt yet the total number of input tokens differs between executions.

For example:

On the first run, the total input token count was 399,341, broken down into 110,061 uncached tokens and 289,280 cached tokens.

On the second run, with the exact same input, the total input token count was 393,172, broken down into 101,332 uncached tokens and 291,840 cached tokens.

While I understand that the number of cached vs. uncached tokens may vary depending on caching mechanisms, I would expect the total number of tokens to remain the same, as the input is identical. Can someone explain why this discrepancy in the total token count is happening, even though the input hasn’t changed?

merefield · October 15, 2024, 10:40am

You don’t detail which API?

Chat Completions?

maximummaggus · October 15, 2024, 10:45am

Hi.

I reworked my post and added Chat Completions API (model gpt-4o-2024-08-06)

merefield · October 15, 2024, 10:48am

So my take is this:

Output is never deterministic (even with Temp zero) and therefore token counts can change.

maximummaggus · October 15, 2024, 11:01am

Damn, sry, i reworked my post again. My token numbers are only the input tokens.

merefield · October 15, 2024, 11:03am

That’s interesting! I open to the floor.

_j · October 15, 2024, 2:55pm

The context length of the gpt-4o-2024-08-06 is 125k, and some of that has to be reserved for receiving a response. Much less than the usage you state.

So the only way that you would obtain a count like 390k input tokens is by recursive operations done by function calls or loops you perform yourself, or with an agent like langchain or OpenAI’s Assistants. The o1 model also performs multiple internal calls, but that token count seems excessive.

If any of that iteration relies on past output, there you have your source of varying input. The AI won’t call the tools the same way, and might not call them at all, because the model output is not deterministic, even with constrained sampling parameters.

If you are using OpenAI embeddings to do a vector database search, that also does not return deterministic results, and different vectors gives different rankings gives different input to an AI model, another source of varying input.

maximummaggus · October 15, 2024, 6:40pm

Thanks for your reply.
A little more context what I’m doing exactly.

I have 128 CVEs in a JSON file, along with a schema file to format the output from the LLM, and a separate file containing my prompt.

My script processes all CVEs by iterating through each one, sending a single CVE with my prompt and schema per request (due to the limited context window). Each request amounts to approximately 3,100 tokens. However, the dashboard is unable to display tokens per request, as it only shows the total number of tokens used within a 15-minute span. As a result, I can only see the total sum of input tokens for all requests within a 15-second window.

_j · October 15, 2024, 7:40pm

This sounds like a “solve for x” problem.

The difference in cached tokens is 2560 between runs. 2560 is evenly divisible by (1024+n*128) token increments. That difference may originate by the very first API call not being cached, and the later ones employing the cache for common context. When we look at the other figures, there is no other solution than 2560 (up to 2687) being the size of the common instructions and schema input.

On the first run: 113 cached calls
On the second run: 114 cached calls

So over 10% are not being reported, or not hitting the cache mechanism despite commonality.

The larger discrepancy in total input can be the varying size the non-common data elements, and that cache return may not kick in for overlapping initial requests.

The fault in the first post, needing to be discovered from the title and know who’s dashboard you are talking about, is that you are trusting the platform site usage page to report your usage to you correctly, and for 15 minute splits to be as discrete as you hope. It has had faults in the last days as major as no billing showing up for at all for multiple users over consecutive days. When operating normally, usage can still trickle in. That’s the explanation you request.

You should log the usage statistics that are returned by the API call itself.
usage_list.append({index:response.usage}), which can be further parsed, to even see if all API calls are successful or silently failing (which your saved data of your task should also show). Then you should see identity in the token count from the same input, and expect that to be your bill.

maximummaggus · October 15, 2024, 7:59pm

Thank you for the detailed response. I’ll definitely start logging everything myself in the future… I guess I was a bit naïve to rely 100% on the dashboard.

platypus · October 15, 2024, 8:22pm

That’s awesome analysis Jay!

_j · October 15, 2024, 8:37pm

What we can consider: if you are doing your own batching by chat completions, wanting maximum input discount for qualifying commonality, yet maximum throughput, don’t async gather, asyncio.queue or spin off QThreads without a first independent API call on item 1 that creates the schema index in OpenAI’s database - and maybe wait a bit still.

Topic		Replies	Views
Discrepancy in Token Counts Between tiktoken and API Usage for o4-mini/gpt-4o-mini Bugs api	1	98	May 28, 2025
Unexpected token counts of 800 however i have just created a run API api	3	138	December 13, 2024
Responses API high token consumption API responses , responses-api	7	128	June 23, 2025
Pricing, Billing and Tokens? Math is not adding up API api	9	2391	February 16, 2024
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	210	November 14, 2024

Identical request input results in different input token counts in the dashboard

Related topics