New Realtime API voices and cache pricing

Are you using someone else’s idea of a typescript library for interacting with realtime? It may have hard-coded allowed values of the past. The same fault as OpenAI’s library - immediately obsolete forever by trying to validate inputs.

You cannot change a voice once a session has been initiated. It is likely equivalent to an initial prompt injection to make the AI follow the response pattern.

1 Like

You can easily circumvent this by coercing the type.

type asdf = "abc" | "def";
const a = "abcd" as asdf;

No fault here. It’s now very easy to know that asdf has been prepared to use a subset of strings, instead of crossing fingers. This is just a very simplified enum.

2 Likes

This is clearly a TypeScript error, which has everything to do with the client you chose to use, and has nothing to do with the actual API itself.

@jeffsharris , can you provide the calculations for this example? Do we have to calc tokens net of cache?

{
            "total_tokens": 3309,
            "input_tokens": 3059,
            "output_tokens": 250,
            "input_token_details": {
                "text_tokens": 1977,
                "audio_tokens": 1082,
                "cached_tokens": 2880,
                "cached_tokens_details": {
                    "text_tokens": 1856,
                    "audio_tokens": 1024
                }
            },
            "output_token_details": {
                "text_tokens": 53,
                "audio_tokens": 197
            }
        },

input_text_price = (text_input - text_cached_input) * 5^E-6 + text_cached_input * 2.5^E-6

input_audio_price = (audio_input - audio_cached_input) * 100^E-6 + audio_cached_input * 20^E-6

output_text_price = output_text * 20^E-6

output_audio_price = output_audio * 200^E-6

X^E-6 is X/1000000

I genuinely don’t know what to explain here

Edit: Chat GPT 4o has worked it out, looks correct to me

# Given token details
text_input = 1977
text_cached_input = 1856
audio_input = 1082
audio_cached_input = 1024
output_text = 53
output_audio = 197

# Pricing rates
text_input_rate = 5e-6
text_cached_input_rate = 2.5e-6
audio_input_rate = 100e-6
audio_cached_input_rate = 20e-6
output_text_rate = 20e-6
output_audio_rate = 200e-6

# Calculate individual costs
input_text_price = ((text_input - text_cached_input) * text_input_rate) + (text_cached_input * text_cached_input_rate)
input_audio_price = ((audio_input - audio_cached_input) * audio_input_rate) + (audio_cached_input * audio_cached_input_rate)
output_text_price = output_text * output_text_rate
output_audio_price = output_audio * output_audio_rate

# Calculate total price
total_price = input_text_price + input_audio_price + output_text_price + output_audio_price

input_text_price, input_audio_price, output_text_price, output_audio_price, total_price
  • Input Text Price: $0.005245
  • Input Audio Price: $0.02628
  • Output Text Price: $0.00106
  • Output Audio Price: $0.0394

Total Price: $0.071985 ​

1 Like

That what I thought it would be, just wanted Jeff to confirm. Unfortunately, that’s just one turn of the conversation, so the API is still very expensive. Usually an entire call costs that much using Deepgram/GPT4o. I would estimate the cost to be about 10x more than our current solution.

I have shared my observations/thoughts on how to assess pricing for this API in this post

1 Like