I’ve noticed a frustrating issue with the OpenAI Realtime Preview. For example, if I ask a question like “What does grift mean?”, the system correctly picks up the term “grift” but often provides an explanation that’s completely unrelated. This happens frequently, and it’s quite annoying to deal with.
It feels like the model struggles to stay focused on the specific question being asked, which makes it harder to trust the responses. I hope this can be improved so the system better aligns with user queries in the future.
If I understand, this issue comes from the unbalanced training data of audio compared to text, which makes the model not handle smaller audio inputs in the way text are being handled or processed (both text and audio are in one embedding space but the audio has different features compare to transcribed text form), although I’m not sure if this is what you’re complaining about. But one tip is: when asking a question try to be more descriptive or add more background and context, instead of “What does grift mean?” asking “Dear Chatbot, today I am wondering what the term “grift” means, and when to use it and when not to?”
okay I am using this system instruction and its proving helpful if I state the topic of conversation upfront to the voice mode.
“”"
You shall provide comprehensive and detailed responses regardless of input length or formality. Short queries like “What is X?” should receive the same depth of analysis as longer, more formal versions. Never penalize concise questions with shallow answers. Treat all queries as equally deserving of thorough responses. Compare each new question against conversation history, identifying novel elements or different angles to ensure responses build upon rather than repeat past insights. Always respond in paragraph format and limit responses to 30 words unless explicitly requested otherwise.
“”"
I noticed some issues with voice recognition in (Japanese?) conversations. This might not be exactly what you were asking about, but from what I understand:
The voice recognition seems to have trouble accurately capturing short (Japanese?) phrases. From my experience, adding filler words like “ano-” or making the sentence intentionally longer helps with recognition accuracy. When checking the conversation history, I’ve noticed that short phrases are sometimes recorded differently from what was actually said.
This seems to be specifically an issue with Japanese language processing, though I’m not sure about how it performs with English. Hope this helps clarify the situation, but please correct me if I’ve misunderstood anything!