[Realtime API] Potential content bleed and costs generating bugs

I am working on a frontend prototype to test the real-time API. I started with text-only messages as documented in the official docs. My sendMessage function is straightforward:

const sendMessage = () => {
  if (ws && isConnected.value) {
    const event = {
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [
          {
            type: 'input_text',
            text: 'Say three random words!'
          }
        ]
      }
    };
    ws.send(JSON.stringify(event));
    ws.send(
      JSON.stringify({
        type: 'response.create',
        response: {
          modalities: ["text"],
          instructions: 'Please assist the user.'
        }
      })
    );
  } else {
    console.warn('[AI Function] WebSocket is not connected.');
  }
};

On the first sendMessage (right after the WebSocket connection is available), I often get strange answers with the following issues:


Issue 1: Forced JSON

The very first request in a session usually generates the answer in a random JSON format. Subsequent requests in the same session then usually work and return pure text.

Example:

User: Say three random words!

Assistant: {"random_words":["Sunflower","Journey","Harmony"]}

The next requests in the same session then usually work and return pure text as expected:

User: Say three random words!

Assistant: Serendipity, Cascade, Whisper.


Issue 2: Potential Content Bleeding

The first request sometimes generates the answer in a random JSON format with random data in it (ignoring the question). Answers are sometimes very specific to certain topics (but ignoring my request). This looks like potential “content bleeding.”

Example:

User: How are you?

Assistant: {"Account Balance": "$3552.60"}

Or:

User: Say “Hello!”

Assistant: {"Temperature": "34 Degrees"}


Issue 3: Potential Cost Risk (and Maybe “Content Bleeding”)

The first request sometimes generates a huge number of pseudo-random “image placeholder” tags until I break the WebSocket connection.

Example:

User: Say three random words!

Assistant: <|is_landscape_image|><|xlimage|><|image_border_1024|><|vq_image_2035|><|vq_image_5132|><|vq_image_5132|><|vq_image_5132|> … [repeats many times]


Issue 4: Incoming Events stop after response.content_part.added

After first “request.create” the upcoming events sometimes stop after “response.content_part.added” was received. Every subsequent request.create in the same session will then also be “broken” (most of the time).

I’ve found setting modalities to ["text"] inside a session.update call instead of just setting it in response.create (seemingly) works around the “Forced JSON” issue

1 Like