I am not 100% sure on how pricing works, but for gpt-4o-mini-tts:
It’s 60 cents per million tokens in and that includes the text and the instructions.
It’s $12 per million tokens for audio out, which I assume is just the number of input tokens.
It’s about 1.5 cents per minute, although that seems to vary widely for me. On short texts, I get charged at about double that, but on my minute long tests, that was just about spot on.
Big gotcha: sometimes audio generation can go off the rails. The same roughly 1 minute text block I gave it once ran for over 3 minutes, and the last 2.5 minutes were silence. So that was a big screwup in the API. And it charged me of course for 3 minutes of audio.
The text tokens used for input are quite inexpensive to reach $0.015 for just 1 minute of output, it would have to be made with around 25k of instruction tokens.
So basically, most of what you pay is for the output.
I’m not related to openai, I speak from my user experience.
If you need more exact proof, there is no need to believe me.
You can easily do it yourself by using playground to convert a text of about 1 minute (roughly a 1000 chars) and check the costs dashboard.
It will tell you how many input text tokens and how many output audio tokens were used, compare to the audio file generated and you can reach your own conclusions.
ps: In the costs dashboard there is an export CSV file that gives you the exact fraction of costs.
I did, but didn’t notice too much difference. whisper-1 model is already pretty good, but it is helpful having different models when the transcription goes wrong and sometimes one of the models work better.
Particularly short audios tend to go bad if the quality is not very good, then having alternatives is always nice.
ps: I forgot to add, if you need transcription on less popular languages it has improved a lot more according to the docs. Like hindi, but since I don’t speak any particular one I can’t say for sure.