I have created an AI audio recorder that summarizes the conversation between two people. I had someone try it for a little longer than expected and they got no result back. I’ve used it for short conversations and got great results.
How could I make it possible to track say…an hour conversation? Would I want gpt-4-32k for that?
You have gpt-3.5-turbo-16k. And you’d sure want to use it.
Price for 16k of input or 16k of output:
Model
Input Price
Output Price
gpt4-32
0.96
1.92
turbo-16
0.048
0.064
Whisper is going to need parts, segments of audio. Check the maximum audio length and file size limits, and be conservative with your audio splitter.
When you do get the final transcript back from reassembling conversation chunks, then you certainly might have more tokens than AI input. Another case where you can chunk (with overlap), ask for an AI summary, and then put a half-dozen summaries in for a final summary. (the word “summary” to the AI means a quite short passage, so you might instead want it to write “an article” based on the transcript summaries taken as a whole).
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference.
I can’t answer what an express server even is, so how you would split audio by silences and chunks would take your own toolchain supported on the platform.
The average speaking rate in a conversation is between 120-150 words per minute. This means that the average person can speak between 6800-9000 words per hour in a conversation.
So on average you will be processing around 9000 words. If we take the conservative estimate of 1.33 tokens per word, you’ll get 9000 * 1.33 = 11970 tokens.
so, wouldnt it therefore actually be 18,000 words an hour because its two people?
maybe not considering you can only push 120-150 words per minute.
you can do time slice during audio recording in the front end. say every 5 minutes, you send the audio data to the backend. then only process it when the user prompts for summary. so let us assume it recorded 1 hour of data so in the backend you have now 12 files of 5 minutes of audio data. you send one by one to whisper api. use part of previous whisper result as prompt to make a seamless transcription. then after every audio data is transcribed, send it to chat api for summary.
If the query you pose to the model is fairly simple, then sure, GPT-4 8k could handle the text from a half hour conversation.
Complex questions and prompts with many layers and instructions will spread the attention too thin for it to be effective, but a single (or maybe a few point) simple request? sure.