Gpt-realtime-1.5: text output mode broken when tools are enabled

I’ve been using gpt-realtime-1.5 for a couple of days now and ran into an interesting issue. When using output_modalities=[“audio”] , the model works great. But when I switch to
output_modalities=[“text”] with tools enabled and rely on an external TTS, the performance drops significantly compared to gpt-realtime.

Issues I’m seeing in text-only mode:

  • Model wraps normal conversational responses in curly braces {} as if it’s outputting JSON
  • Function call arguments leak into the text output channel (the TTS literally tries to speak the function call JSON)
  • Internal control tokens leak into the output, e.g.: <|aesthetics_3|><|has_watermark|>
  • Ignores language instructions that gpt-realtime followed perfectly

None of these issues exist with gpt-realtime in the same configuration, or with gpt-realtime-1.5 in audio output mode. Seems specific to text mode + tools.

5 Likes

I would like to second that there is something very very wrong with output_modalities=[”text”] on the new model. Almost every response it gives is somehow wrong, or is a tool call at the incorrect time. After an incorrect tool call or response, it follows up with an “oops, I messed that up, let’s try again” and tries to continue.

1 Like

Yes this is happening to me too , and for some reason the first turn/message is always stuck till a second message comes

1 Like

Hi and welcome to the community!

I can also reproduce several of the behaviors you described:

  • In text-only mode, the model does return JSON-like content (for example, normal replies wrapped in { ... }) instead of a natural conversational answer.
  • I also see tool-related JSON leaking into the user-facing text output in this setup, which would cause an external TTS that reads the text stream to literally speak JSON.
  • In the same configuration, I see weaker adherence to instructions compared to audio output mode.

Will ping the team to take a look!

Ps. I did not capture the “internal control tokens” leak (<|aesthetics_3|><|has_watermark|>) in my tests. If anyone can share request IDs that will be helpful.

3 Likes

Thanks for reproducing and escalating!

Unfortunately I didn’t capture the specific request IDs for the control token leak at the time I’ll start logging them and share as soon as I can reproduce it again.

1 Like

I can also confirm all these bugs happen in almost any conversation, regardless of the instructions the model gets. For me, as of now, gpt-realtime is superior.

Model wraps normal conversational responses in curly braces {} as if it’s outputting JSON

Function call arguments leak into the text output channel (the TTS literally tries to speak the function call JSON)

Internal control tokens leak into the output, e.g.: <|aesthetics_3|><|has_watermark|>

Ignores language instructions that gpt-realtime followed perfectly


Hey everyone, can someone please provide the request id so we can look at our backend and get this reviewed by our engineering team please!

1 Like

I just saw this on session: sess_DDe8ALUUKkXHDI1OPQcmC . I don’t think I have a request ID, but that session is an example where it returns weird control tokens in the output, and not much else happens.

1 Like

Hi Parashant — seeing the same issu and dont have a request ID either. But have a sessions id:sess_DGDRIlsIi4vdwRDiDmu2c

Would be much obliged if you had any feedback from the engineering team. Very eager to switch to 1.5 in production, but this is a blocker.

Thanks!

I’m seeing the exact same thing with out of bound transcriptions (ie. text output).

Reproduce by following the official realtime_out_of_band_transcription cookbook example and upgrade from gpt-realtime to gpt-realtime-1.5.