How to make ChatGPT's speech part of the generated text

I’m developing a voice assistant app, and I’m using a real-time model to have real-time audio conversations with the user. I want ChatGPT to return two parts: one for the user to hear, and the other to express ChatGPT’s emotion. For example:

User: Tell me a joke.

Assistant: <TEXT>I have a funny joke</TEXT><MOOD>Happy</MOOD>

I can control ChatGPT to output the text as above using a Prompt, but the received audio includes the MOOD part. How can I make ChatGPT only generate the audio of the TEXT part?

Hey

I implemented something similar to this.
However, instead of writing down the mood in the response, I created a tool_calling function “set_emotion”, where, on every turn, my assistant will:

  1. Analyze the conversation
  2. Call set_emotion_tool
  3. Decide on the emotion that it should apply given some rules and pre-defined emotions
  4. Apply that emotion to its voice (this is what the tool returns; for example: “You are very happy; your voice should have a higher pitch, etc…)

This works much better than using the session.updates for example, and with GPT-realtime, it works really well in displaying emotion in the voice.