Hi all,
Firstly credit to the developers for the GA version of Realtime, it’s brilliant and a huge step forward. Also thanks for Hello Realtime which is super helpful.
I have noticed an issue I am putting a workaround in place for.
On Realtime Audio one word spoken answers like yes or no are not reliably spoken despite being visibly shown as spoken in the logs. I have found they are dropped around 30% of the time, and my sense is that the issue is worsened if the model has just been speaking full sentences in the proceeding discussion. This is an issue for my use case where sometimes I need the model to say yes or no.
Tested via the SIP route, it should be possible to recreate the issue with this prompt:
const INSTRUCTIONS = `
Repeat exactly what the user says, back to them.
`;
User: The sky is blue and clouds are white and fluffy.
Bot: The sky is blue and clouds are white and fluffy.
User: My marine animal is magenta in color.
Bot: My marine animal is magenta in color.
User: Yes
Bot: (no response)
User: No
Bot: (no response)
It isn’t reliable so you may get a response but I find it’s inconsistent. You will note that it does appear in the transcript.
I have come up with a workaround which is to adapt the prompt to say if yes, say “The user said ‘yes’”, and this seems to avoid the issue.
Thank you