Understanding Token Limits feeding whisper to gpt-4

I have created an AI audio recorder that summarizes the conversation between two people. I had someone try it for a little longer than expected and they got no result back. I’ve used it for short conversations and got great results.

How could I make it possible to track say…an hour conversation? Would I want gpt-4-32k for that?

1 Like

You have gpt-3.5-turbo-16k. And you’d sure want to use it.

Price for 16k of input or 16k of output:

Model Input Price Output Price
gpt4-32 0.96 1.92
turbo-16 0.048 0.064

Whisper is going to need parts, segments of audio. Check the maximum audio length and file size limits, and be conservative with your audio splitter.

When you do get the final transcript back from reassembling conversation chunks, then you certainly might have more tokens than AI input. Another case where you can chunk (with overlap), ask for an AI summary, and then put a half-dozen summaries in for a final summary. (the word “summary” to the AI means a quite short passage, so you might instead want it to write “an article” based on the transcript summaries taken as a whole).

I’m using an express server so…any idea what that would be? and how i would go about calculate timing when a new recording would have to start?


File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.


Long-Form Transcription

The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference.

I can’t answer what an express server even is, so how you would split audio by silences and chunks would take your own toolchain supported on the platform.

1 Like

According to Bing:

The average speaking rate in a conversation is between 120-150 words per minute. This means that the average person can speak between 6800-9000 words per hour in a conversation.

So on average you will be processing around 9000 words. If we take the conservative estimate of 1.33 tokens per word, you’ll get 9000 * 1.33 = 11970 tokens.


so, wouldnt it therefore actually be 18,000 words an hour because its two people?

a back end API written in express and node.js

I can look up chunking with in node, not using python. Let me know if you know any sources related to node

so, wouldnt it therefore actually be 18,000 words an hour because its two people?

maybe not considering you can only push 120-150 words per minute.

you can do time slice during audio recording in the front end. say every 5 minutes, you send the audio data to the backend. then only process it when the user prompts for summary. so let us assume it recorded 1 hour of data so in the backend you have now 12 files of 5 minutes of audio data. you send one by one to whisper api. use part of previous whisper result as prompt to make a seamless transcription. then after every audio data is transcribed, send it to chat api for summary.

1 Like

I am going to look into this tomorrow, seems like a good solution if I can pull it off

The average speaking rate wouldn’t include pauses where people don’t speak i suppose. So unless both speak all the time, no. Won’t be double.

1 Like

okay but so, just because I’m a lazy bastard.

gpt-4 could easily handle a 30 minute conversation then?

If the query you pose to the model is fairly simple, then sure, GPT-4 8k could handle the text from a half hour conversation.

Complex questions and prompts with many layers and instructions will spread the attention too thin for it to be effective, but a single (or maybe a few point) simple request? sure.

ive got 3 api calls all doing one simple thing with a one sentence system prompt. so the prompt is as simple as can be.

1 Like

or 180,000 words per minute if it’s a board meeting with 20 people? Not quite.

Here’s TV shows, you can see the pace of dialog there and see which has words emanating at the rate of your talkers.

Scripts of season 25 of South Park came in at around 6000 tokens per episide, 22 minutes.

The plodding science of Cosmos (1980, Carl Sagan) measure at 8500 tokens for 60 minutes of clean transcript.

Im not sure what your point is here.

what im creatign is meant for dialogue of two people, theyd be talking at whatever pace two people talk at alone in a room