Hi wischnat,
Today I re-run the same code, I don’t know why I can get log this time
{'type': 'conversation.item.input_audio_transcription.completed', 'event_id': 'event_...', 'item_id': 'item_...', 'content_index': 0, 'transcript': 'Thank you for watching\n'}
I spoken “1+1” in Mandarin but got the transcription ‘Thank you for watching\n’.
How transcriptions work currently is that the transcription is handled separately from the actual AI answering, meaning whatever the AI is responding to does not correspond to the transcription.
I also get English or even French transcriptions when I speak to the AI in German.
This is perfectly normal for now and I think it will be improved upon in the future.
Don’t rely on transcriptions just yet.
You could turn off transcriptions and call the whisper endpoint yourself with the language
parameter set to the language you expect the user to talk in.
This way you would get more accurate transcriptions.
Good luck!
Thank you for your suggestion!
Before GPT-4o realtime API was released, I had already done a two-stage process: first, converting speech to text, then sending the text to GPT to get a response. I got great respone but the processing time was too long. GPT-4o realtime API’s feature is integrating it into one stage, to improve the latency and optimize the user experience.
I get that.
However I suggested to use BOTH solutions. Realtime API in the “frontend” - so who the user is talking to and Whisper endpoint in the “backend” - So transcriptions work properly.
What it sounds like is that you are making no sense.
You replied to a post about an API parameter.
Session update, this topic, takes the field described, max_response_output_tokens
, now described correctly in documentation:
What did I just read.
I don’t think this is even remotely considered on topic, please open a new topic for whatever you are talking about - no offense of course.