Hey! Just here to clarify a few things:
1-2 word answers are the same as if you’re giving a LLM a 1-2 letter prompt.
It won’t really generate anything meaningful because it doesnt have enough context to understand you.
We aren’t sending enough audio samples for it to have any context, so it will start to generate random or default audio.
This is only one of your mentioned issues, so here’s the explanation for the names/addresses.
Let’s take the number 97 as an example.
Since the Realtime API is multilingual, the actual number “97” could be pronounced as “Ninety-Seven”. However it can also be pronounced “Siebenundneunzig” (97 in German), or also “Quatre-vingt-dix-sept” (97 in french) and so many more.
I think you see the issue now.
A fix for this?
Try spelling out the numbers instead of writing actual numbers.
(This is my telephone number: four, nine, six, one, two… etc.)
I hope this helps! 
EDIT: Of course the spelling out only works if the AI would call a function, you can’t just spell every single letter in a call and expect it to understand everything, maybe that does help though - worth a shot. My example was more of a customer support AI that gets its info from a function call.
So normal function call output would be:
Tel: 4917283917
And your processed output, before sending it to the Realtime API would be:
Telephone: four, nine, one, seven, two, eight, three, nine, one, seven