I have a Discord bot that transcribes conversations for users. Just switched over to 4o-mini transcribe. The results are mixed. When it correctly transcribes the utterance, it’s better than nova and I love it. However, there have been two issues:
- Sometimes it will appear that no audio has been passed in, even though I have a check to determine whether the audio captured was too short to send to the transcribe endpoint. The completion in that case will be random guesses (e.g. “The following is a transcribed audio chunk”).
- It will randomly return completions in a language other than the language option being passed into the request. I’ve gotten Japanese, Korean, Arabic, etc. So, bottom line, it’s not respecting the language option. I’ve double checked, and the language value logs successfully and they are ISO-compliant (e.g. “en”).
Here is my function used for the transcription:
const openai = new OpenAI({
apiKey: OPENAI_API_KEY
});
const transcribe = async (buffer, language, prompt) => {
const maxRetries = 5;
const baseDelay = 1000; // Starting delay of 1 second
const maxDelay = 16000; // Maximum delay of 16 seconds
let attempts = 0;
while (attempts < maxRetries) {
try {
// Create a temporary file for the audio buffer
const tempFilePath = `./data/temp_audio_${Date.now()}.wav`;
fs.writeFileSync(tempFilePath, buffer);
// Create a file object compatible with the OpenAI API
const file = fs.createReadStream(tempFilePath);
//prompt: prompt
// Call OpenAI's transcription API
const response = await openai.audio.transcriptions.create({
file: file,
model: "gpt-4o-mini-transcribe",
language: language,
response_format: "json",
});
console.log(response);
// Clean up the temporary file
fs.unlinkSync(tempFilePath);
if (!response.text) {
console.log("No transcription received from OpenAI");
return null;
}
// Calculate duration from the audio buffer (16kHz, 16-bit mono)
const sampleRate = 16000; // 16,000 samples per second
const bytesPerSample = 2; // 16-bit samples = 2 bytes per sample
const durationMs = (buffer.length / (sampleRate * bytesPerSample)) * 1000;
console.log(`OpenAI transcription: ${response.text}`);
console.log(`Estimated audio duration: ${durationMs}ms`);
// Return both the transcription and duration
return {
text: response.text,
duration: durationMs
};
} catch (e) {
console.log(`Transcription attempt ${attempts + 1} failed: ${e}`);
attempts++;
if (attempts < maxRetries) {
// Calculate exponential backoff delay with jitter
// 2^attempt * baseDelay + small random jitter to prevent synchronized retries
const exponentialDelay = Math.min(
maxDelay,
Math.pow(2, attempts) * baseDelay + Math.random() * 100
);
console.log(`Retrying in ${exponentialDelay}ms...`);
// Wait before retrying with exponential backoff
await setTimeout(exponentialDelay);
} else {
// Max retries reached, return null
console.log(`Max retries (${maxRetries}) reached. Giving up.`);
return null;
}
}
}
};
Any insight would be greatly appreciated!!