I’m developing a voice assistant app, and I’m using a real-time model to have real-time audio conversations with the user. I want ChatGPT to return two parts: one for the user to hear, and the other to express ChatGPT’s emotion. For example:
User: Tell me a joke.
Assistant: <TEXT>I have a funny joke</TEXT><MOOD>Happy</MOOD>
I can control ChatGPT to output the text as above using a Prompt, but the received audio includes the MOOD part. How can I make ChatGPT only generate the audio of the TEXT part?
I implemented something similar to this.
However, instead of writing down the mood in the response, I created a tool_calling function “set_emotion”, where, on every turn, my assistant will:
Analyze the conversation
Call set_emotion_tool
Decide on the emotion that it should apply given some rules and pre-defined emotions
Apply that emotion to its voice (this is what the tool returns; for example: “You are very happy; your voice should have a higher pitch, etc…)
This works much better than using the session.updates for example, and with GPT-realtime, it works really well in displaying emotion in the voice.