Model gpt-4o-realtime-preview does not identify the voice of a recording

The gpt-4o-realtime-preview model does not identify the voice of a recording and I don’t know what I’m doing wrong , the method “sendWsOpenAi_Text” work fine ,but, when i use the method “sendWsOpenAi_Audio” the ia answers me with this: “I’m sorry but I can’t identify the voice in a recording”, what is happening?


function connectWsOpenAi(handleRealtimeText,handleRealtimeAudio)
{
    this.ws = new WebSocket(
      this.urlWs, 
      undefined, 
      {
        headers: {
          Authorization: 'Bearer ' + this.apiKey,
          "OpenAI-Beta": "realtime=v1",
        },
      }
    );
  
    this.ws.onopen = () => {
      
      this.ws.send(JSON.stringify({
        type: "session.update",
        session: {
          modalities: ["text", "audio"],
          instructions: "Hablas español mexicano , porfavor asiste al usuario.",
          voice: "alloy",
          input_audio_transcription: {
            model: "whisper-1"
          },
          turn_detection:null
        }
      }));
    };
  
    this.ws.onmessage = (message) => {

      //Obtiene eñ resultado del mensaje 
      const result =  JSON.parse(message.data);

      switch (result.type) {
        case 'response.text.delta':
          //Concatena el mensaje por partes de la IA
          this.messagesText = this.messagesText + result.delta;

          //Retorna el texto en un estado
          handleRealtimeText(this.messagesText);

          break;
        case 'response.text.done':
          //Borra el mensaje local
          this.messagesText = "";

          break;
        case 'response.audio_transcript.delta':
          //Concatena el mensaje por partes de la IA
          this.messagesText = this.messagesText + result.delta;

          //Retorna el texto en un estado
          handleRealtimeText(this.messagesText);
          break;
        case 'response.audio.delta':
          //Convierte el audio de respuesta a buffer
          //const audioData = Buffer.from(result.delta, 'base64');

          //Retorna el audio en un estado
          handleRealtimeAudio(result.delta);

          break;
        case 'response.audio.done':
           //Borra el mensaje local
           this.messagesText = "";
          break;
      }
    };
  
    this.ws.onerror = (e) => {
      console.log("ERROR");
    };
  
    this.ws.onclose = (e) => {
      console.log("CLOSE");
    };
  }

  function sendWsOpenAi_Text(prompt){
    const event = {
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [
          {
            type: 'input_text',
            text: prompt
          }
        ]
      }
    };

    this.ws.send(JSON.stringify(event));
    this.ws.send(JSON.stringify({type: 'response.create'}));
  }

  async function sendWsOpenAi_Audio(uri){

    const base64AudioData  = await filePathToBase64(uri);
    
    const event = {
      type: 'conversation.item.create',
      item: {
        type: 'message',  
        role: 'user',
        content: [
          {
            type: 'input_audio',
            audio: base64AudioData
          }
        ]
      }
    };


    this.ws.send(JSON.stringify(event));
    this.ws.send(JSON.stringify({type: 'response.create'}));
  }
2 Likes

Hey, welcome to the community!

I’m fairly certain this isn’t a listed feature of the Realtime API.

Did you read somewhere that this is possible?

I don’t know, it’s like it’s refusing to respond to me with the Audio method, I haven’t seen any option for a flag regarding this.

There is something interesting and that is that 1 out of every 15 times I ask him something, he listens and answers me coherently.

I want to know what is really happening, since it is Beta, I suppose these types of details will come out.

It may just be a glitch it can “forget” it does stuff like web, images etc. if it is only happening sporadically that suggests the logics glitching in instance. IDK really what yours is “forgetting” but it could be like that. With GPT it is all text just used in creative logic that ain’t perfect from my experience :rabbit::honeybee::heart:

Wait are you on the old voice setting. There is a report in forum a few about it “hearing” but not responding. I had that issue a week or so ago.

This is that rabbit hole :rabbit:
OpenAI Developer Forum — Voice chat listens but doesn’t respond - ChatGPT - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Voice mode not working for custom GPTs - Bugs - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — GPT voice mode is not working - ChatGPT - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Standard Voice not working properly on custom GPTs I have created - Bugs - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Help with chat gpt not hearing my voice - API - OpenAI Developer Forum

community.openai.com

1 Like

I’ll check the hole, for now, the model I’m using is:

gpt-4o-realtime-preview-2024-10-01

This right? https://openai.com/index/introducing-the-realtime-api/

This is what you need to read at first I thought you meant the new AVM for the apple app etc. sorry :rabbit:

OpenAI Developer Forum — How to get gpt-4o-realtime-preview to be more emotive? - API - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Realtime API (Advanced Voice Mode) Python Implementation - API - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Can’t use all 6 voices in gpt-4o-realtime-preview - Bugs - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Connecting to the Realtime API - API - OpenAI Developer Forum

community.openai.com

favicons.png

OpenAI Developer Forum — Announcing GPT-4o in the API! - Announcements - OpenAI Developer Forum

community.openai.com

Okay, it sounded like you were asking how to identify a voice in audio.

You’re just wanting it to respond correctly?

I have the exact same issue, just in Java.

AI speaks just fine, I can hear it.

I speak to the AI → It picks up that I’m talking, just not WHAT I’m saying. → AI responds with an unrelated answer or only a half-correct, unexpected answer. (I usually ask “What color is the sky” and it responds something like “Green is an awesome color. What do you need help with?” or something along those lines.

Sometimes I also get the response where it says “I can’t identify speakers from a recording.”

(Probably a default response to an empty input)

I’m not sure if this is an error on the API enpoint or an error in my code.
I made sure I use the right codec in my app and that all the conversion of the audio is correct (samplerate, codec etc.)
My transcription always returns either an error or empty.

For your issue, maybe check if you’re converting audio right before sending it to the websocket.

I thought I did this but had no luck either.