How to make ChatGPT's speech part of the generated text

annidy · January 8, 2026, 6:26am

I’m developing a voice assistant app, and I’m using a real-time model to have real-time audio conversations with the user. I want ChatGPT to return two parts: one for the user to hear, and the other to express ChatGPT’s emotion. For example:

User: Tell me a joke.

Assistant: <TEXT>I have a funny joke</TEXT><MOOD>Happy</MOOD>

I can control ChatGPT to output the text as above using a Prompt, but the received audio includes the MOOD part. How can I make ChatGPT only generate the audio of the TEXT part?

nikitas · January 23, 2026, 2:27pm

Hey

I implemented something similar to this.
However, instead of writing down the mood in the response, I created a tool_calling function “set_emotion”, where, on every turn, my assistant will:

Analyze the conversation
Call set_emotion_tool
Decide on the emotion that it should apply given some rules and pre-defined emotions
Apply that emotion to its voice (this is what the tool returns; for example: “You are very happy; your voice should have a higher pitch, etc…)

This works much better than using the session.updates for example, and with GPT-realtime, it works really well in displaying emotion in the voice.

Topic		Replies	Views
How to replicate Custom GPT “Voice Mode” quality via API? API custom-gpt , voice-mode	0	188	August 11, 2025
Simultaneous Text Response and Tool Invocations in GPT-4 API API gpt-4	3	2699	March 8, 2024
How to get gpt-4o-realtime-preview to be more emotive? API	2	396	October 7, 2024
How to stop GPT-5 from exposing reasoning before tool calls? Prompting chatgpt , gpt-5	1	1411	September 25, 2025
How to get text only output from the Realtime API? API api , realtime	14	5335	June 20, 2025

How to make ChatGPT's speech part of the generated text

Related topics