When I use WebSocket to connect to the Realtime API, sometimes the Realtime system produces strange sounds, such as laughter or suddenly speaking in another language.
Here is my workflow:
I use my VAD (Voice Activity Detection) system. When the audio is converted to text, the text is sent to the Realtime API. When VAD starts, I send:
type: 'response.cancel'
And clear the previous item_id
using conversation.item.truncate
.
After that, I send the text:
type: 'conversation.item.create',
item: {
type: 'message',
role: 'user',
content: [{
type: "input_text",
text: message
}],
}
Tool Call Processing Workflow
When I receive response.function_call_arguments.done
, I process the corresponding tool, and then send:
const payload = {
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: callId,
output: JSON.stringify(output)
}
}
this.wsClient.send(JSON.stringify(payload))
this.wsClient.send(JSON.stringify({ type: 'response.create' }))
Is there anything wrong with this workflow?