I am trying to develop video explanation service for my personal project. I wanted to give fast responses to user so I chunked the audio file, transcribe it. And not waiting all chunks to be transcribed, I am thinking of sending each transcribed chunk with prompt to gpt api.
What I want gpt api to perform is to remember the chunks and when gpt detects a context change, it produce text explanation of the context.
But I found out that each gpt api request is stateless.
So what i want to know is
is there any way to send chunks sequencialy to gpt api and get response created from multiple former chunks?
if possible, is it possible to keep sending chunks while gpt api detects context change and creating explanation text?
In the context of a video explanation service that uses audio, here is some advice I can offer:
It might be best to handle the audio data segmented for transcription separately from the transcribed text data.
Using a transcription API like Whisper, you could store the transcription results as text and then send the accumulated transcription data from the beginning up to the current point to the chat completion endpoint, which can then detect context changes.
Additionally, with this approach, you could continue sending data to the transcription endpoint, such as Whisper, while also sending the transcribed text results to the chat completion endpoint.
By separating the transcription process and the context change detection process, it may be possible to realize the type of service you’re aiming for.
Thank you for your reply.
what i have done so far is I fined tuned the whisper model in my local computer to evaluate the accuracy of It terminology.
So when there is a request,
I download the audio from youtube as WAV,
segmented it based on silence part of the audio,
transcribe each chunks asynchronously,
produce results in kafka.
what I need to do now is to consume this results, send this results to gpt api and get explanation based from it.
Do I need to collect multiple results and send to api? or can I send each results separatly.
Because I am concerned of too much credit when sending the whole text to api. Plus I want to create result fast by stream-like way
Even if you send each result individually to the chat completion endpoint, as you mentioned previously, the API operates in a stateless manner, meaning that each API call functions independently. In other words, the language model does not retain the outcomes of previous API calls.
With the increased context length of language models, there is now more room to understand longer contexts, but this is only achievable when the context is included in the API call itself. While you may not need to send all transcribed content each time, you must send enough past text data up to the current point to detect changes in context.
This principle also applies if you use the chat completion endpoint similarly to ChatGPT. If you utilize the entire 128K context, each API call will incur the cost for 128K tokens, although the exact cost will vary based on how the 128K tokens are divided between input and output.
This is likely the reason why, outside of the ChatGPT Enterprise plan, the context length is capped at 32K tokens.
I don’t know exactly how you’re detecting context changes, but you need to include sufficient context data—though perhaps not the entire context—in the payload of each API call to detect these changes effectively.
Since you’re using a fine-tuned Whisper model on your local computer, transcription does not incur API usage costs, so you’ll likely need to account only for the costs associated with sending data to the language model.