I would like to ask a question, hopefully somebody can help me answer it.
Is there a way to calculate the cost of a streaming LLM request for which we don’t trigger the generation of the response?
This is because for some application, I launch the request, but the generation is only needed if some conditions are satisfied (e.g., `skip_this_request=False`).
Thanks a lot in advance for any ideas!
# LLM request: this must cost something, but how much?
response = await aclient.responses.create(model="gpt-4o", input=messages, stream=True)
# Generation happens here, token usage only available at the end of the iteration
async for x in response:
print(f"Chunk: {x}")
Do you mean calculating the cost of the resultant output if it were produced?
No. Not really possible, because calculations are typically handled using tokens, and there will always be a variance in the amount of output tokens produced. And with reasoning, that variance is even higher.
Sending a request to the model will always trigger a response.
The real question is: how to decide whether a response is actually needed?
If the answer can be determined programmatically, the best option is to avoid sending the request at all.
If the model itself has to decide whether to reply, there are mechanisms to handle that:
With the Completions API, you can use a stop token and structure the prompt so the model emits it right away, effectively interrupting generation.
With the Realtime API, you can update the session to stop further responses.
This approach essentially acts like a classifier.
If the API in use doesn’t offer an option to terminate the response early it’s an option to build lightweight classifiers with small models. Alternatively one can leverage prompt caching to try and keep latency and cost under control.
It’s usually more efficient to handle the decision programmatically whenever possible.