Strange Responses from Realtime API Audio

When I use WebSocket to connect to the Realtime API, sometimes the Realtime system produces strange sounds, such as laughter or suddenly speaking in another language.

Here is my workflow:

I use my VAD (Voice Activity Detection) system. When the audio is converted to text, the text is sent to the Realtime API. When VAD starts, I send:

type: 'response.cancel'

And clear the previous item_id using conversation.item.truncate.

After that, I send the text:

type: 'conversation.item.create',
item: {
    type: 'message',
    role: 'user',
    content: [{
        type: "input_text",
        text: message
    }],
}

Tool Call Processing Workflow

When I receive response.function_call_arguments.done, I process the corresponding tool, and then send:

const payload = {
    type: "conversation.item.create",
    item: {
        type: "function_call_output",
        call_id: callId,
        output: JSON.stringify(output)
    }
}
this.wsClient.send(JSON.stringify(payload))
this.wsClient.send(JSON.stringify({ type: 'response.create' }))

Is there anything wrong with this workflow?