How to hardcode a response based on the words spoken by the user?

Hi, I’m trying to create a functionality where the assistant will respond with a specific phrase if my speech contains a certain word. I can detect this phrase, but I don’t know how to create an event to let the assistant know exactly what to respond with. Could someone help me?

client.on('conversation.updated', async ({ item, delta }: any) => {
      const hasBlue = item.content?.find(({ transcript }) =>
        transcript?.includes('blue')
      );

      // TODO: create message when hasBlue is true

      if (delta?.audio) {
        wavStreamPlayer.add16BitPCM(delta.audio, item.id);
      }
      if (item.status === 'completed' && item.formatted.audio?.length) {
        const wavFile = await WavRecorder.decode(
          item.formatted.audio,
          24000,
          24000
        );
        item.formatted.file = wavFile;
      }
    });

If you are using the realtime API and voice input, and are relying on a transcript, then the AI has already spoken and sent its response.

You thus would need to hold back any playback until you receive a transcript if you do not want that original to be heard.

If it is truly a specific phrase that must be spoken, like “I’m sorry, my response to that mentioned Sam Altman, which is not allowed”, it may be less work to just obtain a recording in the voice and play it yourself as a substitute. Otherwise, you must send another message as user for another cost, possibly as voice to not break a voice chat, and thus producing a confusing chat history as result.

If you have multilingual concerns, the replacement audio could be fulfilled with a separate request to realtime.

In that case, at which point should I stop the spoken sound and then trigger the appropriate phrase to be spoken? All within the ‘conversation.updated’ event?

You have to figure that out for yourself.
Easiest implementation is to stop any incoming audio from OpenAI and first validate the audio the user is sending to OpenAI. If your word occurs, don’t send to OpenAI and instead play a pre-recorded audio.
Or alternatively, send a response.create with the modalities of text AND audio where you enter the instruction. “The user mentioned the word blue, tell him a fun fact about the color blue.” Or whatever you want the AI to talk about.

Do not use the transcription as it can be very unreliable and gets sent AFTER the AI is already responding.
You have to use your own implementation of transcribing or detecting words, maybe a seperate call to a standalone transcription model before you call the OpenAI Realtime API. Of course this would add latency.

Below is a diagram that would explain this process.

Feel free to elaborate further.

Good luck :hugs:

sequenceDiagram
    participant User
    participant System
    participant OpenAI
    participant AudioPlayer

    User->>System: Send audio
    alt Validate audio
        System->>StandaloneTranscription: Transcribe audio
        StandaloneTranscription-->>System: Transcription result
        System->>System: Check for specific word
        alt Word found
            System->>AudioPlayer: Play pre-recorded audio
        else Word not found
            System->>OpenAI: Forward audio
            OpenAI-->>System: Response
            System-->>User: Deliver response
        end
    else Do not validate audio
        System->>OpenAI: Forward audio
        OpenAI-->>System: Response
        System-->>User: Deliver response
    end
    note right of System: Ensure to handle latency
1 Like

Hey,

I know it’s late, but figured I’d jump in the conversation with a potential solution in case someone ever comes across the same issue.

Why wouldn’t you simple create a tool for that?
I’ve had similar issues with names. I noticed the realtime API is unable to correctly understand names, even when asking the user to spell it out. The whisper transcript from the user does however understand it.

So what I did, is add the following tool:

realtimeClient.addTool(
    {
        name: 'getSpelledName',
        description: 'Fetches the last saved user transcript containing a name spelled by the user.',
        parameters: {
            type: 'object',
            properties: {
                transcript: {
                    type: 'string',
                    description: 'The user transcript containing the spelled name.',
                },
            },
            required: ['transcript'],
        },
    },
    async ({ lastUserTranscript }) => {
        return { lastUserTranscript };
    }
);

This is how I save the user transcript for the tool:

realtimeClient.on('conversation.updated', ({ item, delta }) => {
    if (item.type === 'message' && item.role === 'user' && item.formatted.transcript) {
        lastUserTranscript = item.formatted.transcript;
    }
}

And this is how I tell the assistant to call this tool whenever the customer spells his name:

# Instructions
- Whatever the question of the user is, always start by asking for the full name and birthdate of the person.
- Always ask the user to spell out his last name.
- Whenever the user spells their name, call the "getSpelledName" tool to retrieve it.

Hope this helps !

1 Like