Audio Model Pricing is Unclear

According to the documentation, Whisper is $0.006/minute while TTS is $15.00 for 1M characters. It would be nice if there was an estimated cost per minute like the new 4o-mini audio models have so the current users of Whisper have a better idea of potential cost savings.

Also, aside from the prompt, what else can contribute toward the input token cost of 4o-transcribe and 4o-mini-transcribe?

I am a bit confused too. It seems to use a concept of audio tokens, not directly relatable into “minutes”. I couldn’t find any further information though, but in the API output it will tell you how many tokens were consumed.

What I can say is that summing up it all it is very low cost, you can check on your usage dashboard.

Basically, for TTS you have the prompt for instructions, which follow the usual text token measure, plus the audio tokens for the generated audio.

The graphic seems pretty complete.

Whisper-1 is only billed per-minute, exactly. You send fast-talking micromachines or droll Prairie Home Companion, you get the same cost.

The last column IS the estimated cost of operation of the modality transformation gpt models under discussion.

You’d certainly be able to tack on extra expense if you had maximum prompting for voice tone not spoken.