Gpt-4o-transcribe outputs content from prompt instruction for small/silent audio samples

We are using the prompt field in the gpt-4o-transcribe API to give the model a list of entities that we want it to transcribe correctly.

Example:

prompt: “Important words (glossary) for correct transcription: Sarah, Martinus, Kasper, Camilla”

Often, when we send audio segments with very little audio content in them (happens often when using gpt-realtime + gpt-4o-transcribe), we transcription returns content that includes the glossary words, even if they are not mentioned at all.

Example: “Sarah”, “Martinus and Camilla”, “Sarah, Martinus, Kasper, Camilla” etc.

Are there any way around that? In gpt-4o-realtime, we rely on the server vad, so it’s up to the server to decide what audio segments should be transcribed. Therefor it seems out of our hands.

Any workarounds or are there a better way to achieve what we want?

Thanks!

1 Like