Realtime API cost anomaly: disproportionate charges on audio input

Hello,

We are using the Realtime API (gpt-4o-realtime-preview-2024-12-17). When reviewing the usage dashboard for a 15-minute session, I noticed that the cost for audio input was $5.28, while the cost for audio output was $0.65.

This seems inconsistent with expected behavior. During the session, I used very short input sentences, while the model responded with longer outputs. According to the Realtime pricing model (Per 1M tokens), audio input is billed at $40, and audio output at $80. Based on that, the output cost should be higher than the input cost.

By this logic, the cost for audio input in this session should be lower than $0.65, not $5.28.

We use OpenAI dashboard, please see below all the data for the 15-minute session with associated costs.

realtime api | gpt-4o-realtime-preview-2024-12-17 audio, input
Cost: $5.28

realtime api | gpt-4o-realtime-preview-2024-12-17 audio, cached input
Cost: $0.57

realtime api | gpt-4o-realtime-preview-2024-12-17 audio, output
Cost: $0.65

realtime api | gpt-4o-realtime-preview-2024-12-17 text, input
Cost: $0.43

realtime api | gpt-4o-realtime-preview-2024-12-17 text, cached input
Cost: $0.43

realtime api | gpt-4o-realtime-preview-2024-12-17 text, output
Cost: $0.05

gpt-4o-transcribe audio, input
Cost: <$0.01

gpt-4o-transcribe text, input
Cost: <$0.01

gpt-4o-transcribe text, output
Cost: <$0.01

text-embedding-3-small
Cost: <$0.01

For reference, we are in a quiet environment and manually activate the microphone by holding down a button when speaking.

Has anyone experienced something similar and figured out what was going on?

Thanks

[Update]
@OpenAI_Support @gokulraya @jeffsharris

We changed the Realtime version from ‘gpt-4o-realtime-preview-2024-12-17’ to ‘gpt-4o-realtime-preview-2025-06-03’

For a 15-minute session, the cost for audio input is now $0.33, the cost for audio cached input is $0.80, and the cost for audio output is $0.80

It still doesn’t match what is announced in the Realtime pricing.

First, I don’t know why the usage of audio cached input is higher than the audio input.

Then, regarding OpenAI publication:
Audio input is priced at $100 per 1M tokens […] This equates to approximately $0.06 per minute of audio input.
https://openai.com/index/introducing-the-realtime-api/

Currently the audio input pricing is $40 per 1M tokens, approximately $0.024 per minute of audio input.

For audio cached input, pricing is currently $2.50 per 1M tokens, approximately $0.0015 per minute of audio cached input.

The audio input cost is $0.33, which corresponds to approximately 13.8 minutes of speech. However, this is not realistic, as I did not speak for 13.8 minutes during the 15-minute session.

The audio cached input cost is $0.80, which translates to approximately 533 minutes of audio, this is clearly not possible given the session duration.

The audio output cost of $0.80 for the 15-minute session appears consistent.

Could someone help with this?

Thanks.

Hello,

In addition, we are observing unexpectedly high costs related to text input and text cached input in the Realtime API.

Here is a breakdown of our usage:

  • The instruction prompt for Realtime: 262 tokens
  • Function calling definition: 996 tokens
  • Text input exchanged during the 15-minute session: up to 5,000 tokens
  • Spoken input: up to 250 tokens

Maximum estimated usage: ~6,500 tokens

However, the usage reported on the dashboard is significantly higher:

Text input: $0.05 → approximately 10,000 tokens

Text cached input: $0.54 → approximately 216,000 tokens

This results in a total of 226,000 tokens, which is far beyond our expected maximum of 6,500

Could you please help us understand where this additional usage might be coming from, and whether this could be an error in token accounting?
@OpenAI_Support

Thank you in advance.

Would it be possible to get support from OpenAI on this?

Thanks!

Hello,

Anyone from OpenAI to help?

This is key topic as it is related to usage / billing.

Thanks

I think this is just a misunderstanding of how the technology works.

Despite being “realtime”, the generation of a response to you is turn-based.

A server side message history is maintained and appended to with every new generation, whether triggered by sending an API event after sending to the buffer, or triggered by the end of voice activity detection. You are not given a cost-management mechanism, it just grows and grows.

The cached figure means you are being re-billed for what input was seen before in previous response generation. The AI model has to understand and be passed again all conversation to respond appropriately to the latest turn.

Example input to the model being billed:

Turn 1:

user: “Please permanently talk like an Australian”

Turn 2:

user: “Please permanently talk like an Australian” (cached)
ai: “G’day mate, dinkum wallaby on the barbie. Let’s crack on!”
user: “Not a stereotype, an accurate accent.”

Hello @_j,
Thank you for your response.

Even considering your explanation, the audio input cost of $0.33 seems high compared to the actual spoken content during the session. The same applies to the cached audio input cost of $0.80.

Additionally, the costs for text input and cached text input is also higher than expected.

We believe there may be an issue with the current calculation of Realtime API usage, especially when compared to the information published by OpenAI in their official announcement: https://openai.com/index/introducing-the-realtime-api