How to hardcode a response based on the words spoken by the user?

aiypwzqp · October 30, 2024, 7:35pm

Hi, I’m trying to create a functionality where the assistant will respond with a specific phrase if my speech contains a certain word. I can detect this phrase, but I don’t know how to create an event to let the assistant know exactly what to respond with. Could someone help me?

client.on('conversation.updated', async ({ item, delta }: any) => {
      const hasBlue = item.content?.find(({ transcript }) =>
        transcript?.includes('blue')
      );

      // TODO: create message when hasBlue is true

      if (delta?.audio) {
        wavStreamPlayer.add16BitPCM(delta.audio, item.id);
      }
      if (item.status === 'completed' && item.formatted.audio?.length) {
        const wavFile = await WavRecorder.decode(
          item.formatted.audio,
          24000,
          24000
        );
        item.formatted.file = wavFile;
      }
    });

_j · October 30, 2024, 9:07pm

If you are using the realtime API and voice input, and are relying on a transcript, then the AI has already spoken and sent its response.

You thus would need to hold back any playback until you receive a transcript if you do not want that original to be heard.

If it is truly a specific phrase that must be spoken, like “I’m sorry, my response to that mentioned Sam Altman, which is not allowed”, it may be less work to just obtain a recording in the voice and play it yourself as a substitute. Otherwise, you must send another message as user for another cost, possibly as voice to not break a voice chat, and thus producing a confusing chat history as result.

If you have multilingual concerns, the replacement audio could be fulfilled with a separate request to realtime.

aiypwzqp · October 31, 2024, 8:53am

In that case, at which point should I stop the spoken sound and then trigger the appropriate phrase to be spoken? All within the ‘conversation.updated’ event?

j.wischnat · October 31, 2024, 9:07am

You have to figure that out for yourself.
Easiest implementation is to stop any incoming audio from OpenAI and first validate the audio the user is sending to OpenAI. If your word occurs, don’t send to OpenAI and instead play a pre-recorded audio.
Or alternatively, send a response.create with the modalities of text AND audio where you enter the instruction. “The user mentioned the word blue, tell him a fun fact about the color blue.” Or whatever you want the AI to talk about.

Do not use the transcription as it can be very unreliable and gets sent AFTER the AI is already responding.
You have to use your own implementation of transcribing or detecting words, maybe a seperate call to a standalone transcription model before you call the OpenAI Realtime API. Of course this would add latency.

Below is a diagram that would explain this process.

Feel free to elaborate further.

Good luck

sequenceDiagram
    participant User
    participant System
    participant OpenAI
    participant AudioPlayer

    User->>System: Send audio
    alt Validate audio
        System->>StandaloneTranscription: Transcribe audio
        StandaloneTranscription-->>System: Transcription result
        System->>System: Check for specific word
        alt Word found
            System->>AudioPlayer: Play pre-recorded audio
        else Word not found
            System->>OpenAI: Forward audio
            OpenAI-->>System: Response
            System-->>User: Deliver response
        end
    else Do not validate audio
        System->>OpenAI: Forward audio
        OpenAI-->>System: Response
        System-->>User: Deliver response
    end
    note right of System: Ensure to handle latency

mikado · January 23, 2025, 10:39am

Hey,

I know it’s late, but figured I’d jump in the conversation with a potential solution in case someone ever comes across the same issue.

Why wouldn’t you simple create a tool for that?
I’ve had similar issues with names. I noticed the realtime API is unable to correctly understand names, even when asking the user to spell it out. The whisper transcript from the user does however understand it.

So what I did, is add the following tool:

realtimeClient.addTool(
    {
        name: 'getSpelledName',
        description: 'Fetches the last saved user transcript containing a name spelled by the user.',
        parameters: {
            type: 'object',
            properties: {
                transcript: {
                    type: 'string',
                    description: 'The user transcript containing the spelled name.',
                },
            },
            required: ['transcript'],
        },
    },
    async ({ lastUserTranscript }) => {
        return { lastUserTranscript };
    }
);

This is how I save the user transcript for the tool:

realtimeClient.on('conversation.updated', ({ item, delta }) => {
    if (item.type === 'message' && item.role === 'user' && item.formatted.transcript) {
        lastUserTranscript = item.formatted.transcript;
    }
}

And this is how I tell the assistant to call this tool whenever the customer spells his name:

# Instructions
- Whatever the question of the user is, always start by asking for the full name and birthdate of the person.
- Always ask the user to spell out his last name.
- Whenever the user spells their name, call the "getSpelledName" tool to retrieve it.

Hope this helps !

Topic		Replies	Views
Realtime api not understand phone number API realtime	14	1170	January 23, 2025
Why is realtime model so bad at understanding sequences of numbers? API realtime	17	1635	April 28, 2025
Realtime api phone use case - speaking text Feedback assistants-api , realtime	16	1654	November 5, 2024
How do you handle user transcripts in real-time GPT-4o chats? API gpt-4	2	327	June 3, 2025
Input_audio_transcription in realtime-api API	5	3736	February 20, 2025

How to hardcode a response based on the words spoken by the user?

Related topics