How to make ChatGPT speech part of the generated text

I’m developing a voice assistant app, and I’m using a real-time model to have real-time audio conversations with the user. I want ChatGPT to return two parts: one for the user to hear, and the other to express ChatGPT’s emotion. For example:

User: Tell me a joke.

Assistant: <TEXT>I have a funny joke</TEXT><MOOD>Happy</MOOD>

I can control ChatGPT to output the text as above using a Prompt, but the received audio includes the MOOD part. How can I make ChatGPT only generate the audio of the TEXT part?