Hi. We are building an app that helps practice speaking a foreign language. Due to certain limitations it is currently set as follows: AI starts the conversation with a question, a user has to tap on a button to speak, the voice is recorded, the user taps to sand, and the message is transcribed. Sometimes when a user taps to send the message but hasn’t said anything basically an empty message/audio file is sent to AI. The problem that we are facing is that openai hallucinates a response for both a user and its own reply to the empty audio file. Do you have any ideas what we could do to resolve it? Would be very grateful.
Yes, it is a known problem. Your best bet is to preprocess your audio data prior to sending it for transcription by removing the silent parts. You can use ffmpeg for this. The disadvantage of doing this is if you need the timestamps and you need ffmpeg installed in the server.
1 Like
This is a known quirk of the model with respect to attempting to process files with no dialog.
You could always just check the maximum volume of the audio file and not process it if it is below a certain threshold.