I’m encountering an issue where identical Chat Completions API (model gpt-4o-2024-08-06) requests are resulting in different input token counts on the OpenAI dashboard. Specifically, I’m sending the exact same input with no changes to the prompt yet the total number of input tokens differs between executions.
For example:
On the first run, the total input token count was 399,341, broken down into 110,061 uncached tokens and 289,280 cached tokens.
On the second run, with the exact same input, the total input token count was 393,172, broken down into 101,332 uncached tokens and 291,840 cached tokens.
While I understand that the number of cached vs. uncached tokens may vary depending on caching mechanisms, I would expect the total number of tokens to remain the same, as the input is identical. Can someone explain why this discrepancy in the total token count is happening, even though the input hasn’t changed?
The context length of the gpt-4o-2024-08-06 is 125k, and some of that has to be reserved for receiving a response. Much less than the usage you state.
So the only way that you would obtain a count like 390k input tokens is by recursive operations done by function calls or loops you perform yourself, or with an agent like langchain or OpenAI’s Assistants. The o1 model also performs multiple internal calls, but that token count seems excessive.
If any of that iteration relies on past output, there you have your source of varying input. The AI won’t call the tools the same way, and might not call them at all, because the model output is not deterministic, even with constrained sampling parameters.
If you are using OpenAI embeddings to do a vector database search, that also does not return deterministic results, and different vectors gives different rankings gives different input to an AI model, another source of varying input.
Thanks for your reply.
A little more context what I’m doing exactly.
I have 128 CVEs in a JSON file, along with a schema file to format the output from the LLM, and a separate file containing my prompt.
My script processes all CVEs by iterating through each one, sending a single CVE with my prompt and schema per request (due to the limited context window). Each request amounts to approximately 3,100 tokens. However, the dashboard is unable to display tokens per request, as it only shows the total number of tokens used within a 15-minute span. As a result, I can only see the total sum of input tokens for all requests within a 15-second window.
The difference in cached tokens is 2560 between runs. 2560 is evenly divisible by (1024+n*128) token increments. That difference may originate by the very first API call not being cached, and the later ones employing the cache for common context. When we look at the other figures, there is no other solution than 2560 (up to 2687) being the size of the common instructions and schema input.
On the first run: 113 cached calls
On the second run: 114 cached calls
So over 10% are not being reported, or not hitting the cache mechanism despite commonality.
The larger discrepancy in total input can be the varying size the non-common data elements, and that cache return may not kick in for overlapping initial requests.
The fault in the first post, needing to be discovered from the title and know who’s dashboard you are talking about, is that you are trusting the platform site usage page to report your usage to you correctly, and for 15 minute splits to be as discrete as you hope. It has had faults in the last days as major as no billing showing up for at all for multiple users over consecutive days. When operating normally, usage can still trickle in. That’s the explanation you request.
You should log the usage statistics that are returned by the API call itself. usage_list.append({index:response.usage}), which can be further parsed, to even see if all API calls are successful or silently failing (which your saved data of your task should also show). Then you should see identity in the token count from the same input, and expect that to be your bill.
Thank you for the detailed response. I’ll definitely start logging everything myself in the future… I guess I was a bit naïve to rely 100% on the dashboard.
What we can consider: if you are doing your own batching by chat completions, wanting maximum input discount for qualifying commonality, yet maximum throughput, don’t async gather, asyncio.queue or spin off QThreads without a first independent API call on item 1 that creates the schema index in OpenAI’s database - and maybe wait a bit still.