Estimate the cost for 1 min usage of Real-time API

Can someone help in calculating the cost for the usage Real-time API for 1 min. Can be a rough estimate. But can someone help me in estimating that.? It can be normal conversation

On average it is 6 cents for audio input and 24 for audio output (per minute).
https://openai.com/api/pricing/

1 Like

That does not allow for the fact that each triggering of a response adds audio to a chat history, growing the audio input every time the AI responds, which can be multiple times within a minute. You are paying over and over for things said and spoken before.

1 Like

There are multiple factors to this.

  1. The price per minute will always depend on the number of turns happened in the conversation. With number of turns increasing, the price will also increase because the conversation history gets bigger and the model keeps consuming it all with each new turn.
  2. Prompt caching should be included in the equation as well, because all or most tokens from previous turns will hit cache (both audio and text, both input and output)
  3. The context window of the model is 128k tokens, but the actual output window is only 4096. The average amount of input tokens, the average amount of output tokens (which are priced differently), and overall ratio between them is a major factor
  4. The usage of function calls should also be considered, because functions can be rather large depending on the use case, and if the model will call them multiple times, or call multiple functions per turn (which could not be supported at this time, don’t quote me on that), the token usage will increase significantly.
  5. Tokenization of texts (and maybe audio, not known at this time) has different efficiency depending on the language of the input. For tokenizers used in OpenAI models, English language is the most efficient (i.e. the least tokens per amount of text). This means that if text (and maybe audio) is very inefficiently tokenized, the amount of tokens – and the price of turn/session – will increase.
  6. Model behavior itself can produce extra turns (verifying user input, making sure that assistant heard the user correctly etc.) which will increase usage.
  7. Model can “glitch”, hallucinate, call functions erroneously, produce enormous amount of output, users can try to do prompt injection, force the model out of bounds etc. which is also a major cost factor.

My experience after hours of conversation so far is about 1$ per minute

1 Like