We have been using Realtime API since preview and switched to V2. We see that V2 likes giving long responses. For example, it would say “Hi I am a Bot. What kind of car are you looking for? Any particular color? Would you like to schedule test drive”
Previous versions would be asking one or two question at a time and build on that.
Your only control surface to alter the model behavior is the system message you provide. You’ll need to communicate the type of responses that are expected to counter any seen symptoms.
A model ends its chat turn by producing a stop sequence. On chat models, this is a trained special token that is emitted. Different models have different qualities of predicting a stop token to end the message, versus predicting the next word, the next word of another sentence. Some recent models from OpenAI have been quite bad at this, where they simply can’t end, and even repeat the same response, starting over all by themselves or making a new message start without termination being output.
You can’t really communicate, " you produce a stop token which ends your turn", as the AI isn’t really self-aware that is what it is doing, and actually placing the special tokens by their string mapping is disallowed. What you can do is reinforce the style of conversation exchange. This might take the form, “a Bot assistant produces only one question, not a sequence of question sentences, and then the user replies to that single question in their own message.”
Responses and the realtime API don’t let you provide your own stop sequence that can be trained or instructed, unlike Chat Completions, which is for developers that understand AI instead of consumers that want an over-featured product to be used only one designed way.
This reinforces that when post-training the weights of a model on its message format, especially those models with different modalities, the right amount of reinforcement of predicting stop sequences is important because
I am not delivering any products with realtime, so I don’t have particular applications that need refinement where I experience differences in symptom cross model, except for experimentally using them, and seeing more the quality in voices and steerability.
What I notice as part of the product envelope, that I can bring to your attention, is the client event to create a conversation item, which can be a mid-session tune up for increased following (one that will be part of the conversation that can be truncated out), and can be textual and a system role message.
RealtimeConversationItemSystemMessage
A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation’s behavior, use instructions, but for smaller updates (e.g. “the user is now asking about a different topic”), use system messages.
So, you have:
tune up gpt-realtime-2 with messaging;
don’t migrate, wait for a model that can fulfill your needs.