Any experiences yet on how to estimate cost of O1 series api calls?
Pricing page gives the price per token, but the total tokens produced during user message is unknown because output and reasoning tokens from each step are billed as well. I am looking for guidance or experience on what is the range of tokens produced over the whole reasoning sequence. Either in general cases, or even better if for specific use cases.
The costs could add up quickly if each stage is max 128k, and there could be 1…N stages. I am worried if the total cost of query could reach $30 - $120 per query in some cases, exceeding cost of human experts. Is that a reasonable worry?
Perhaps I am misunderstanding the point/concern but for a given API request, the context window is limited to 128k tokens, shared between input, reasoning and output tokens.
Of course you can have a multi-turn conversation but that would not change this upper limit of 128k tokens.
I agree that depending on how many output tokens you aim for, it could still get fairly expensive but if I am not mistaken just under $5 should be the maximum for a single request under o1-preview.
This may be of help and you may know this already.
Currently only the tokens in the prompt are counted and billed. With the o1 models, the thought process is now dynamically creating prompts and those prompts, which are converted to tokens along the way, are also being counted and billed. Since there is no way to actually measure how many thought prompts will be created and run there is not way to accurately gauge how much such a user prompt will cost that includes the eventual cost of the thought prompts.
So the OP is wondering if there is at least a way to estimate the final cost given the initial prompt and not knowing how many thought prompts and thus tokens will be billed in the end.
It totally depends on the type of input, like a normal AI might write “hi” or might produce an analysis of each chapter of War and Peace. But amplified
Amplified significantly by the complexity of the questioning you provide, and how many rounds of reasoning are employed by the internal mashings-about.
That is, with each internal inference on a mini-task carrying the original context length of the input you provided with it, besides what is being added, for every iterative turn and every new little answer the AI has to produce to add to that. Like the diagram, but 30x wider.
Short input: So, sending “hi” as input, might only have one or two reasonings digging into the deeper nature of what “hi” could mean and if it violates OpenAI policies to ask (and yes, you pay for their policy rounds) Or “without omission, in the field of orbital astrodynamics, provide the formulas used in each maneuver in launching and landing an Apollo moon mission, and the rationale and prerequisites of understanding behind each” might be a short input with a huge internal context being built (if it doesn’t go lazy).
Long input: You are paying for that document in tokens over and over and over.
We can play the “guess how many tokens” game for a particular question in this topic, and then call and pay. A progress report is seen in ChatGPT, but not for those paying and who might want an audit.
If it could take up to 30x = 90 steps á 10 000 tokens, or 10 steps with 128 000 tokens, then it would be likely the price exceeds cost of an expert everywhere in the world. An expert in low-cost country would be much more desirable. Especially if said expert will use a 20 dollar LLM subscription.
“without omission, in the field of orbital astrodynamics, provide the formulas used in each maneuver in launching and landing an Apollo moon mission, and the rationale and prerequisites of understanding behind each”
Of course I am not sure, but I think a MSc or PhD student in physics could write within 1 hour at the same accuracy as o1. Without physics studies it would take longer, but still doable within a few hours. It would not be 100% accurate, but neither o1.
So, that makes me think if this is an experiment for human-level AGI pricing, or if the intent is to discourage use O1 but it is available for political correctness? Using ChatGPT seems a better deal with 120 messages per month, 17 cents each, if only using O1.
Alternatively, is it possible the average of most O1 queries still is around 17 cents because each step only consumes so few tokens?
Final thought, as I would imagine, many of the mentioned use cases (genetics, physics) would benefit from automating lots of queries instead of just a few runs if it can actually solve problems. Such as “I need to find material that fits this criteria, please solve for these 1000 materials”, but it should be cheaper than having 100 research assistants from India to solve manually. Research assistants probably will do a better job and might also make some other observations.
That’s a very valid point…and I attempted to reason around it leveraging o1.
This is the result:
Input Tokens = Prompt + System
Output Tokens = Reasoning + Final Output
Why Are Reasoning Tokens Counted as Output?
Generation Phase: The model’s reasoning happens during the generation of the response. Since it’s part of the output process, it falls under completion tokens.
Resource Utilization: The computational resources used to generate hidden reasoning steps are part of the model’s workload in producing a response.
Hidden Costs: Hidden reasoning tokens increase the completion token count, which can affect your overall costs.
Example Scenario
User Query:
“Explain the impact of the Industrial Revolution on modern society with maximum 350 output tokens.”
Token Consumption Estimation:
• Prompt Tokens: Let’s say the prompt consumes 15 tokens.
• Model’s Hidden Reasoning (Completion Tokens): The model internally uses 150 tokens to process and reason about the question.
• Visible Output (Completion Tokens): The model generates a 350-token response.
This conversation got me thinking over the last few days.
I’m obviously not the first person to express that but looking at my output token consumption, I definitely would want greater control over the reasoning tokens.
The max_completion_tokens parameter is inherently constrained as it does not allow you to differentiate between reasoning tokens and the visible output tokens.
I wonder whether a tiered system for reasoning tokens could be a solution whereby you for example have three different tiers you can choose from depending on the complexity of a problem. Each tier would have an associated upper boundary / ceiling for the max reasoning tokens.
There are of course situations where it makes complete sense to exhaust the full reasoning tokens, especially when it comes to more open-ended problems where the model would be used to come up with new solutions.
However, there are also cases where a reasoning path may be fairly pre-defined and where the focus is more on using the model’s capabilities to apply the pre-defined reasoning path to new data / input. One may argue that this is not the core purpose of o1 but I d think that even with clearly defined reasoning paths o1 still frequently produces superior outputs than a gpt-4o or a gpt-4-turbo.
Either way, more steerability of the reasoning tokens and reasoning process in an API environment would really be desirable.
I agree. There could be also other that could help meanwhile those are built. Such as maximum cost cap per request (“You will not be billed over $5 per request sent…”)
For now, before any further development needs to be done, most helpful would be to get some ranges of what to expect. If anyone from OpenAI reads this, perhaps it would be possible to reveal the token distribution from ChatGPT user’s tasks? Such as 80% of requests are between X and Y tokens, and top 1% is between Z and R tokens.
Otherwise it’s hard to budget if we can use O1 or not, and getting a budget to test the cost distribution with reasonable certainty is even harder…