API model whisper - Real cost

Hello everybody.
We spent some days to check whisper model to transcript mp3 to srt.
We also generated some stats

Total files: 734
Total time: 2,333,349 seconds (648:09:09)
Estimated cost: 233.34 $

At the moment, we spent 397,08 $
So the cost is not 0.006 $ / minute but the real cost should be 0.010 $ per minute.
A big difference.
Some user have same results?

P.S for website developers
It’s not a good idea to use $ as special char to format text :rofl:
I recommend using other methods already widely used for text formatting so users do not have to escape.

\text{Input: } 397.08 \text{ \$}

\text{Correct to US Dollar Amount: } \$397.08

Using “formatting” correctly (TeX in mathjax). Or you can just write $333.33 normally.


How long of chunks are you sending, and do you have any overlap between sections of silence detection? While they could round up by the second when billing, you could also do an aggressive “round down” of not sending silence.

Hi _j, tnx for explaning the $ behaviour.

Files uploaded are from 40 to 70 minutes.
We just use whisper model with openai php class.

    try { $response = $this->client->audio()->transcribe([
      'model' => 'whisper-1',
      'file' => fopen($to_upload, 'r'),
      'response_format' => 'verbose_json',
    ]); } 

Reading API docs, we didn’t see any custom params to skip silences, music or other parts, so the cost will be 0.006 for each minute, also for silence, music or bad parts.
It’s correct?
If yes, I don’t understand why the cost is doubled? :face_with_raised_eyebrow:

The OpenAI model is inherently a 30 second max model, and OpenAI does magic techniques to blend these inferences together themselves. If you send 600 seconds exactly and get billed for more, it might be that they are billing for actual model cost including their own overlapping techniques.

You can use silence detection algorithms to both split chunks so you do your own management of keeping pieces below 30 seconds, and also to strip out where you are otherwise billed for audio with no speech to transcribe.

1 Like

So the cost increase if the lenght of the mp3 file is over 30 seconds?
In this case we need to split every file in multiple chunks then join the subs…
A lot of work.
Tnx _j, very usefull.

That is just a theory for you to test.

However, chunk the audio, and you can batch it to dozens of API calls at the same time.


I thought I’d do a test to see if I can get overbilled. The price of whisper is $0.006 per minute, or $0.0001 per second (rounded to seconds per pricelist). 1000 seconds (16:40) would be $0.10.

So I edit an mp3 at the frame level, and we try to stimulate even an incorrect rounding up → 16:39.48 by the frames, 16:39.262 when opened in an audio editor. If they round the second up, I should be charged exactly $0.1000. If they overbill, it will be more.

Expand for transcription received.

The Feynman Lectures on Physics. This lecture was presented by Dr. Richard Feynman on October 20th, 1961 at the California Institute of Technology. Volume 1, Chapter 8, Motion. Section 8.1, Description of Motion. Well, this is Lecture 8, according to that, and the lecture this time is on motion. As time goes on, things change, and if we’re going to ever be able to find the laws of these changes, the least that we’re going to have to be able to do is to describe changes, to have some way to record them and register changes. Now, the simplest change to register is the apparent change in position of an object, a solid object, or something that has a mark on it. And that’s what we’re going to talk about here today. We’ll suppose that you can make some kind of a mark on this, a permanent mark on the object, and we want to discuss the motion of this little mark, or which I’ll call a point. To be less elegant and general, take an example of an automobile, and you take the center of it or the radiator cap, and you try to figure out or describe the fact that it moves and how it moves. Sounds like nothing. There are some subtleties, however. Take another example, you have a falling ball, and you talk about how, say, one point on the ball, say, the center of it, falls. Now, there are some changes which present more difficulty of description than the motion of an object or a point on an object. For example, the speed of drift of a cloud, which is drifting very slowly but rapidly forming or evaporating, is almost impossible to define if you think about it a while, if the thing is disappearing or forming. While you’re trying to measure how fast it’s moving, it’s not so easy to define. Another example of a kind of change that we find some difficulty in describing is, for example, the change of a woman’s mind. We have no simple way of analyzing that. But it is hoped, perhaps, that the cloud can be represented, described by many molecules, and perhaps then we can describe the motion of the cloud in principle by describing the motion of all the individual molecules. And likewise, perhaps even the changes in the mind have a parallel in change to the atoms inside the brain, but we don’t know that yet. At any rate, that’s the reason why we will begin with the motion of points. Perhaps you could think of them as atoms. But better to be more rough. It’s better to be more rough at the beginning and to say just some kind of small things. A small compared to the distance moved. For instance, if a car is going a hundred miles, and we want to describe it, we don’t have to worry about whether we’re talking about the front of the car or the back of the car. It may make a difference if the car’s turned around at the other end. There’ll be slight differences. But for rough purposes, we say the car. And in the same way, we’ll not worry about the fact that our points are not absolute points, etc. I’m not going to be extremely precise. Also, for a first look at this thing, we’re going to forget about the three dimensions of the world. We’ll just concentrate on moving in one direction, like on one road. And we’ll come right back to three dimensions after we figure out how to describe motion in one dimension. Now you say, this is all some kind of trivia. And you’ll see it is. How can we describe such a motion? Let’s say of a car. Nothing could be simpler. We do something well. There’s lots of ways. One way would be the following. You say, where’s the car? You measure the distance of the car at different times. So you make a chart. So you have a time here and the distance. I call the distance s in feet, say. And we’ll take the time in minutes. Now at no time, we’ll say zero time, the car hasn’t started yet. But after one minute, it’s started and it’s gone, say, 120 feet. Then in two minutes, it goes a little further. You’ll notice that it picked up more distance in the second minute because it’s accelerating, say. Third minute, we make it 900, say. The fourth minute, 950. Something happened between three and four. And it’s worse even at five. It stopped at a light. And now it speeds up again and goes 1,300 feet by the end of six. And 1,800 feet at the end of seven. And 2,350. And then at nine, you’ll find it’s only gone up to about 2,400 because in here it was stopped by a cop. Now, that’s one way to describe the motion, and you can go on with this. Another way is with a graph. If you plot the time this way and the distance this way, then you see that this thing will break a curve, something like this. As the time increases, the distance increases first slowly and then more rapidly. Then it slows up for a little while in here, and then it rises again and starts to peer out up there, something like that. So you can make a graph. The motion of a car is artificial and complicated. I mean, rather it’s complicated, not artificial. Obviously, to get a complete description, I would have to tell you where it is in the half-minute marks, too. But we suppose that that means something, that it’s got some position at all the intermediate times. Now, if we take another example of something that moves in a simpler manner, one which is more simple laws, is that of a falling body. And for a falling body, if you made such a chart, the time in seconds and the distance in feet, this is seconds now. And you had no seconds, and you started out with no feet. Then at the end of one second, it falls 16 feet. At the end of two seconds, it falls 64 feet. At the end of three seconds, 144 feet. I better look, yeah, 256. And at the end of five, 400, and so on. And if you plot the curve of this thing, then you’ll get a nice curve. It looks nice like this. This is a parabola, actually. This is the distance, and this is the time. As a matter of fact, I can give you the formula for this, so you can calculate this distance. At any time, it’s 16 feet times the square of the time. So if you put that in, you’ll get the right answer. You might say that there ought to be a formula for this one, too. Well, there might be. So mathematicians say, abstractly, f of t, meaning some formula depending on t, or what they call a function of t. But they don’t know what the function is. So there you are. I mean, there’s no nice way to write it in algebraic form. So there’s two examples of motion perfectly adequately described. Very simple idea. No subtleties. Well, there are subtleties. There are several subtleties in the first place. What do we mean by time and space? All these deep philosophical questions. It turns out that these questions have to be analyzed very carefully in physics, and it isn’t so simple. And the theory of relativity shows that our ideas of space and time are not as simple as you would think at first sight. However, for the present purposes, for the accuracy that we want at first, we don’t have to be very careful about defining the things very precisely. You say, that’s a terrible thing. I learned that in science you must define everything precisely. You cannot define anything precisely. Otherwise, you get into that paralysis of thought that comes in philosophers who sit opposite each other, and one says to the other, you don’t know what you’re talking about. The first says, what do you mean by no? What do you mean by speech? What do you mean by you? And so on. So, in order to be able to talk, we just have to agree that we’re talking roughly about the same thing. And I know that you know as much about time as I need you to know, because you got here on time, and you know what that means. And that’s as good an idea of time as we need. But there are these subtleties that have to be discussed, but we’ll discuss them later. Another subtlety involved was already mentioned, that it should be possible to imagine that the point which moves is always located somewhere. Of course, when you’re looking at it, there it is. But maybe if you look away, it isn’t there. Well, it turns out that in the motion of atoms, that that idea is also false, that you can’t find a marker on an atom and watch it move. So that doesn’t work either, and that’s another subtlety that we’ll have to get around in quantum mechanics. But as we are going to do, we’ll first learn to see what the problems are before the complications, and then we’ll be in a better position to correct it for the more recent knowledge on the subject. So we’ll take a simple point of view about time and space. You know what it means in a rough way. If you’ve driven a car, you know what a speed means and so on. Nevertheless, there are still some subtleties, rather deep ones, because if you come to think of it, the Greeks were never able to describe motion. Well, they could do this all right, but they couldn’t describe problems involving the velocity. The subtlety comes when you’re trying to figure out what you mean by the speed. The Greeks got very confused about this, and a new branch of mathematics had to be discovered beyond the geometry and algebra of the Greeks and Arabs, or Babylonians, or whoever you want that made the algebra. In order to describe this thing, if you don’t believe it, you go home and solve this problem by sheer algebra problem. A balloon is being blown up so that the volume of the balloon is increasing at 100 cc per second. At what speed is the radius increasing? Try that with algebra, we’ll see whether you can do it. The Greeks got somewhat confused. They were helped by, of course, some very confusing Greeks. We always think of them as being very great and subtle, and I think that some of the subtlety is due to, well, that’s a personal opinion. Anyway, to show that there were difficulties in the reasoning at the times, Zeno produced a large number of paradoxes, of which I’ll mention one, to show you that he doesn’t mean what the conclusion of the paradox, he just means that there are obvious difficulties in thinking about motion, is what he’s trying to say. Because listen, he says, to the following argument. Achilles runs 10 times as fast as the tortoise. Nevertheless, he can never catch the tortoise. For suppose that they start in a race, where the tortoise is 100 meters ahead of Achilles. Then, when Achilles has run 100 meters to the place where the tortoise is, the tortoise has proceeded forward of 10 meters, having run 1 tenth as fast. Now, Achilles has to run the 10 meters in order to catch up with the tortoise, but on arriving at that point, he finds that the tortoise is still 1 meter ahead of him. And so, running another meter, finds the tortoise 10 centimeters, and so on, ad infinitum. Therefore, at any moment, the tortoise is always ahead of Achilles, and Achilles can never catch up with the tortoise. What’s wrong with that? What’s wrong with that is, of course, that a finite amount of time can be divided into an infinite number of pieces, just like a line can be divided into an infinite number of pieces by dividing in half, half, half, half, half, and so on. And so, although there are an infinite number of steps in the argument to the point at which Achilles reaches the tortoise, it doesn’t mean there’s an infinite amount of time. Well, there are some subtleties in this. Now, in order to get to the subtleties in the clearest possible fashion, I remind you of a joke. I know you’ve heard this joke. At this point where the lady is caught by the cop, the cop comes up to her and says, Lady, you were going 60 miles an hour. And she says, That’s impossible, sir. I was only traveling for 7 minutes. Well, of course, it’s ridiculous. How can you go 60 miles an hour when I wasn’t going an hour? And, of course, the question is, How would you answer her if you were the cop? Well, if you were really the cop, then no subtleties are involved. It’s very simple. You say, Tell that to the judge. But now let’s suppose that we haven’t got that escape, but we take a more honest intellectual attack on the problem and try to explain to this lady what we mean by the idea that she’s going 60 miles an hour. Just what do we mean? So we start. What we mean, lady, is this, that if you kept on going the same way as you’re going now, in the next hour you’d go 60 miles. She’d say, Well, my foot was off the accelerator and the car was slowing down, so if I kept on going that way, it would not go 60 miles. Or take this ball, which is falling, and we want to know the speed at this time, 3. If the ball kept on going the way it’s going, meaning what? Kept on accelerating? Kept on going faster? No. It kept on going in a certain way the same, the velocity the same, but that’s what you’re trying to define. Because if this ball kept on going the way it’s going, it’ll just keep on going the way it’s going. So we need to define the velocity better. What has to be kept the same? Well, the lady could argue this way. If I kept going the way I’m going for one more hour, I’d run into that wall at the end of the street. So you see, it’s not so easy. Now, we have to say what we mean. Well, lots of physicists think that measurement is the only definition of anything. So obviously the thing to do is to use the instrument that measures the speed, the speedometer, and say, look, lady, your speedometer read 60. So she said, all right, next time I’ll break the speedometer. Or my speedometer’s broken and didn’t read at all. Does that mean the car’s standing still? We believe that there is something to measure. Before we build the speedometer, so we can say, for example, the speedometer isn’t working right, or the speedometer is broken. That would be a meaningless sentence if the velocity had no meaning independent of the speedometer. So we have in our minds, obviously, an idea which is independent of the speedometer, and the speedometer’s only meant to measure this idea. What’s this idea? So let’s see if we can get a better definition. Say, look, I know if you went an hour, you’d hit it. But if you went one second, you’d go 88 feet. Lady, you were going 88 feet per second. And then if you kept on going the next second, it would be 88 feet, and the wall down there is too far. So he says, yes, but there’s no law against going 88 feet per second. It’s only a law against going 60 miles an hour. But, says the judge or the cop, it’s the same thing. Well, if it’s the same thing, it shouldn’t be necessary to go into this circumlution about 88 feet per second. In fact, it’s for the falling body. You couldn’t even go one second, you see, because you’ll be changing your speed in a way, and you’ll have to define the speed somehow. Now, I think, though, we’re getting on the right track. Something like this, you see. If you kept on going for another thousandth of an hour, you’d go a thousandth of 60 miles. That’s the idea. In other words, you don’t have to keep on going for the whole hour. The point is that, for a moment, you’re going at that speed. Now, what that means is that if you went just a little bit more in time, the extra distance that you’d go would bear in a proportion the same as a car that goes at a steady speed of 60 miles an hour. In other words, perhaps the idea of the 88 feet per second is right. We see how far she went in the last second, and divide by 88 feet. 88 feet per second is 60 miles an hour. Divide by 88 feet and see if it comes out one, if it is at 60 miles an hour. In other words, in order to find the speed, we can find the speed this way. We ask, how far did you go in a very short time? Well, how far did you go in the last short time? And divide that distance by the time, and that gives the speed.

Now your billed amount is not part of the API return like a language models “tokens”. However, for me, this is the only API audio today, so when we get a usage bargraph, and check the source code to get even more digits, I can obtain my cost of that one call. Unfortunately, I discover the usage page API now sends amounts in cents instead of fractions of a cent.

999.262 seconds = $0.10

I suppose I could send another $0.001 of audio at a time over two hours and see when the penny flips, but that would be pedantic.

So I would suggest “check your math”.

_j I’m not sure that that is true, can you point me to where it says whisper is only for 30 second transcriptions?

my math is very simple.
with command

/usr/bin/ffprobe -v error -show_entries stream=duration -of json <file>

i get [‘streams’][0][‘duration’] that is the duration in seconds.
i sum all durations and i get the total time in seconds.
ffprobe value are correct.

It’s not only 30 second transcriptions. It’s 30 second windows. Yes, it’s true. You can find it here:

We developed a strategy to perform buffered transcription of long audio by consecutively transcribing 30-second segments of audio and shifting the window according to the timestamps predicted by the model

1 Like

Thank you for this information! Now I know this is indeed correct, Thank you!

1 Like

So sending long form audio to the back end - is it better to chunk in 30 second intervals? Or does that not directly matter?

We’re already doing this in our app im just wondering if 30 seconds would be more optimized for splitting the audio in chunks when sending to back end

I don’t think fragmentation is the cause of the problem related to the cost.
For next files I’ll save the returned JSON with all informations.
Unfortunately API doesn’t return the cost, only the duration, and I didin’t find a way to get current credit balance via API.

It’s better to let whisper handle the chunking for you as it will be based on the timestamps. It will overlap them accordingly so context isn’t (hopefully) cut in half.

You may want to look at:

4.5. Strategies for Reliable Long-form Transcription

Found in the paper above

PS when the OpenAI pricelist says “$0.006 / minute (rounded to the nearest second)” - they mean “rounded up”.

0.262 seconds is not 1.0 seconds when rounded to the nearest second.

12:02 PM
Local time: Nov 4, 2023, 5:02 AM
whisper-1
1 request (1000 seconds total)

If for each file the price is 0.006 higher (rounded up), 734 * 0.006 = $4
In my case, the total cost is almost double.
Estimated $233.34 and real $397.08.

But… if the round up is applied to each segment? :thinking:

In my case, 648 hours transcribed.
If openai split each audio track in 30 seconds segments (3.8. Long-form Transcription)

648 hours, 2.333.349 seconds
/ 30 = 77.779 segments

  • 0.006 = $466.67

Too much.

If you transcribed 734 files by making 734 API requests, then the maximum overbilling that you should experience would be that around half the files would have an extra second of billing. A very small fraction.

I would look at your script and your files, and run it again not with a call to the API, but with a very verbose logging and an inclusion of an audio library that can decode the audio and produce accurate file statistics in seconds of length. Replicate the original calls to produce a much better API log than you might have had before.

Then actually look at the usage page and your per-5-minute whisper API calls. Look at the number of seconds that were logged as billed.

If you don’t have strong evidence with logging that you submitted a different number of seconds than you were billed for, and that your script ran as expected, then it is unlikely that you would receive any correction, as the API seems to work OK.

Instead, you must investigate the individual and actual API calls you made and how they resulted in the billing you received.

1 Like

All my audio files are 44.100Hz, mono, 8-bit, all converted with Audition.
If the filesize is > 26.214.400 (openai limit), my script resample the file

        $new = "/tmp/".ws_randstr(13).".mp3";
        $cmd = "ffmpeg -i \"".$file."\" -ac 1 -ar 22050 \"".$new."\" -hide_banner -loglevel error";

This change the quality, not the duration in seconds.
If the API doesn’t return more informations, especially the cost for the task completed, it’s almost impossible to debug the problem.

Now I’m saving JSON.
I’ll check next files, the duration returned into JSON, the online credit balance and I’ll give you more infos.
Tnx all for help.

An aside: OpenAI doesn’t charge for excessive quality or audio data. Only seconds.

But it also doesn’t help. The AI wasn’t trained on CD audio quality nor accepts it internally. But 8 bit audio is quite poor. It will either have excessive dither noise or quantization noise.

Use 16-bit audio. 44.1/16/2 => 22.05/16/1 would be good. MP3 is actually 24 bit audio in and out if done right. 8 bit is a Sound Blaster from 1992.

I actually found same or better transcription on audio re-encoded to mono Opus (in an .opus OGG container) when using the VOIP setting and speech encoding bandwidth below 24kbps.

That’s optional along with your silence detection. There was a guy here on the forum wondering about hallucinations where there was no audio – and he was sending 15 minutes of silence that could have been free with pre-processing.

AI does a good enough job that there’s no motivation to use FFT noise fingerprint removal techniques or other audio enhancements beyond reducing the bandwidth of speech to that of a phone call - and in fact it is AI that is probably going to replace a lot of traditional algorithmic audio.

1 Like

I make some test.
Here the results:

Files: 50
Seconds (ffmpeg): 124637,20
Seconds (api response): 124636,25
Estimated cost: $12,46
Real cost: $11,83

In this case, some file was not charged.
File #5 was sent and transcribed but “Credit balance” was not updated.
File #6 was sent, the API returned “Syntax error” and “Credit balance” was not updated.
File #7 was sent, transcribed but “Credit balance” was not updated.
From file #8 all normal until #16 then some problem on server “Failed to load credit balance, please try again later” and sometime “The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID XXX in your email.)”.

I think there is some problem about cost calc: occasionally skipped, occasionally amount is wrong, in general is correct.

Please don’t intentionally cut your audios up to 30s duration to save cost, it will affect the accuracy of your transcription, if you want to go ahead, make sure you pass in the previous 30s transcription as prompt to increase the accuracy, but I don’t think it’s necessary, the model already cuts audio into 30 seconds window

1 Like