Whisper - opaque charges?

I am using Whisper, and from my calculations, I’m being overcharged quite a bit (about 25% more than what I am sending). I noticed this and then I had an idea - I sped up the files using ffmpeg before I sent them to the API. Not sure if this is explicitly allowed, but I scoured the ToS and could find nothing prohibiting it. Anyway, it was only a test with a lowish volume of audio. The transcription accuracy is almost the same with a 2x speedup of the input file but, astonishingly, I am being charged the same as if I didn’t speed them up.

We get no indication of pricing back from the API, only tokens returned. Does the Whisper API actually charge on a per-token basis instead of a minutely basis? Is there any visibility on this?

Whisper is $0.006 / minute (rounded to the nearest second)

So, that’s $0.0001 / second.

If you get the same accuracy at 2x speed, I suppose that’s a clever way to cut your costs in half.

Hell, cut all your audio down to ~10s chunks, ensuring the length is such that it always rounds down. Don’t forget to skim through and trim the silence between words.

Could probably get your costs down to about 1/3 of what they would normally be.

That’s my point though. I’m sending in half the audio and i’m still getting overcharged more than if they were at 1x speed.

I’d need to see more evidence to verify that.

How can I even prove it if I can’t find what I’m being charged for a request? There is no itemized bill available.

It is pretty apparent from the audio quality drop that a request with a different speed does not change the way audio is generated by AI, that the audio is just passed through a time-slicing pitch/tempo changer to get to a new speed without pitch alteration.

Pitch isn’t altered in the files I send in. Think speeding up like a voice note in whatsapp. So, you’re saying that OpenAI charges essentially by token then and not by second as stated in its docs?

Sorry, I was thinking of the text-to-speech output, which uses overlapping time-slicing to lengthen or shorten the output audio you receive, with artifacts.

Speeding audio up seems a decent way to save some money, but I would compromise between allowing some increased pitch so there is less choppy time slicing going on.

I investigated the price before. And just did again weeks ago. Send exactly an hour, get billed for an hour. ($0.006 / minute x 60 minutes = $0.36)


The request powering the bargraph for cost, in cents, with the precision to fractions of a cent:


Note that these are combined with other requests for a UTC date cutoff, and may have a delay in showing up.

1 Like

How are you speeding up the audio with ffmpeg?

Can you share a sped up sample?

Sure, I am speeding the files up with the ffmpeg library in node:

    .on("end", () => callback(null, outputFilePath))
    .on("error", (err) => callback(err))

Which I believe is the same as this command:

ffmpeg -i inputFilePath -filter:a "atempo=2" outputFilePath

Here’s the before file:

And the sped up file, which gets sent to Whisper:


This is a test file I’m going to record to test Whisper’s charges, um, I don’t know what to say, but I’ll, uh, just say what I see. Actually, you know what? I’m gonna get a bottle of water, and I’m gonna open the water, I’m gonna sit back down, um, yeah, should we go for a minute? Yeah, I’ll go for exactly a minute, see how that works out. It’ll be an M4A file, I believe. Um, I am looking at the clock, oh, it’s one minute past eight, so that means I’m late, but it’s okay. Um, yeah, we have eight seconds left, uh, hopefully I stopped exactly on time, and I’m gonna finish now.

I tested the 2x sped-up file provided by you for transcription and I can confirm that the API bills only for the duration of the sped-up file(rounded to the nearest second).

Here are the details of the original file:

  • File Type: MP3
  • Duration: 60.10 seconds
  • Sample Rate: 48000 Hz
  • Channels: 1 (Mono)
  • Bit Rate: 128.069 kbps

The sped-up file has the following:

  • File Type: MP3
  • Duration: 30.05 seconds
  • Sample Rate: 48000 Hz
  • Channels: 1 (Mono)
  • Bit Rate: 64.088 kbps

The 2x sped-up file provided by you was the only one I transcribed today and here’s the screenshot of usage:

Hence if you upload the 2x sped-up file, you’ll only be billed for half the duration.


Thank you very much for testing that for me. I have looked extensively at my code and tested what is being sent to Whisper, but I must be making a mistake somewhere. Thanks again.

1 Like