Model gpt-4o-realtime-preview does not identify the voice of a recording

enrique.escobedo · October 16, 2024, 5:42pm

The gpt-4o-realtime-preview model does not identify the voice of a recording and I don’t know what I’m doing wrong , the method “sendWsOpenAi_Text” work fine ,but, when i use the method “sendWsOpenAi_Audio” the ia answers me with this: “I’m sorry but I can’t identify the voice in a recording”, what is happening?


function connectWsOpenAi(handleRealtimeText,handleRealtimeAudio)
{
    this.ws = new WebSocket(
      this.urlWs, 
      undefined, 
      {
        headers: {
          Authorization: 'Bearer ' + this.apiKey,
          "OpenAI-Beta": "realtime=v1",
        },
      }
    );
  
    this.ws.onopen = () => {
      
      this.ws.send(JSON.stringify({
        type: "session.update",
        session: {
          modalities: ["text", "audio"],
          instructions: "Hablas español mexicano , porfavor asiste al usuario.",
          voice: "alloy",
          input_audio_transcription: {
            model: "whisper-1"
          },
          turn_detection:null
        }
      }));
    };
  
    this.ws.onmessage = (message) => {

      //Obtiene eñ resultado del mensaje 
      const result =  JSON.parse(message.data);

      switch (result.type) {
        case 'response.text.delta':
          //Concatena el mensaje por partes de la IA
          this.messagesText = this.messagesText + result.delta;

          //Retorna el texto en un estado
          handleRealtimeText(this.messagesText);

          break;
        case 'response.text.done':
          //Borra el mensaje local
          this.messagesText = "";

          break;
        case 'response.audio_transcript.delta':
          //Concatena el mensaje por partes de la IA
          this.messagesText = this.messagesText + result.delta;

          //Retorna el texto en un estado
          handleRealtimeText(this.messagesText);
          break;
        case 'response.audio.delta':
          //Convierte el audio de respuesta a buffer
          //const audioData = Buffer.from(result.delta, 'base64');

          //Retorna el audio en un estado
          handleRealtimeAudio(result.delta);

          break;
        case 'response.audio.done':
           //Borra el mensaje local
           this.messagesText = "";
          break;
      }
    };
  
    this.ws.onerror = (e) => {
      console.log("ERROR");
    };
  
    this.ws.onclose = (e) => {
      console.log("CLOSE");
    };
  }

  function sendWsOpenAi_Text(prompt){
    const event = {
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [
          {
            type: 'input_text',
            text: prompt
          }
        ]
      }
    };

    this.ws.send(JSON.stringify(event));
    this.ws.send(JSON.stringify({type: 'response.create'}));
  }

  async function sendWsOpenAi_Audio(uri){

    const base64AudioData  = await filePathToBase64(uri);
    
    const event = {
      type: 'conversation.item.create',
      item: {
        type: 'message',  
        role: 'user',
        content: [
          {
            type: 'input_audio',
            audio: base64AudioData
          }
        ]
      }
    };


    this.ws.send(JSON.stringify(event));
    this.ws.send(JSON.stringify({type: 'response.create'}));
  }

PaulBellow · October 16, 2024, 6:54pm

Hey, welcome to the community!

I’m fairly certain this isn’t a listed feature of the Realtime API.

Did you read somewhere that this is possible?

enrique.escobedo · October 16, 2024, 7:01pm

I don’t know, it’s like it’s refusing to respond to me with the Audio method, I haven’t seen any option for a flag regarding this.

There is something interesting and that is that 1 out of every 15 times I ask him something, he listens and answers me coherently.

I want to know what is really happening, since it is Beta, I suppose these types of details will come out.

mitchell_d00 · October 16, 2024, 7:05pm

It may just be a glitch it can “forget” it does stuff like web, images etc. if it is only happening sporadically that suggests the logics glitching in instance. IDK really what yours is “forgetting” but it could be like that. With GPT it is all text just used in creative logic that ain’t perfect from my experience

mitchell_d00 · October 16, 2024, 7:07pm

Wait are you on the old voice setting. There is a report in forum a few about it “hearing” but not responding. I had that issue a week or so ago.

This is that rabbit hole
OpenAI Developer Forum — Voice chat listens but doesn’t respond - ChatGPT - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Voice mode not working for custom GPTs - Bugs - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — GPT voice mode is not working - ChatGPT - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Standard Voice not working properly on custom GPTs I have created - Bugs - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Help with chat gpt not hearing my voice - API - OpenAI Developer Forum

community.openai.com

enrique.escobedo · October 16, 2024, 7:12pm

I’ll check the hole, for now, the model I’m using is:

gpt-4o-realtime-preview-2024-10-01

mitchell_d00 · October 16, 2024, 7:18pm

This right? https://openai.com/index/introducing-the-realtime-api/

This is what you need to read at first I thought you meant the new AVM for the apple app etc. sorry

OpenAI Developer Forum — How to get gpt-4o-realtime-preview to be more emotive? - API - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Realtime API (Advanced Voice Mode) Python Implementation - API - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Can’t use all 6 voices in gpt-4o-realtime-preview - Bugs - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Connecting to the Realtime API - API - OpenAI Developer Forum

community.openai.com

OpenAI Developer Forum — Announcing GPT-4o in the API! - Announcements - OpenAI Developer Forum

community.openai.com

PaulBellow · October 16, 2024, 7:28pm

Okay, it sounded like you were asking how to identify a voice in audio.

You’re just wanting it to respond correctly?

j.wischnat · October 21, 2024, 9:11am

I have the exact same issue, just in Java.

AI speaks just fine, I can hear it.

I speak to the AI → It picks up that I’m talking, just not WHAT I’m saying. → AI responds with an unrelated answer or only a half-correct, unexpected answer. (I usually ask “What color is the sky” and it responds something like “Green is an awesome color. What do you need help with?” or something along those lines.

Sometimes I also get the response where it says “I can’t identify speakers from a recording.”

(Probably a default response to an empty input)

I’m not sure if this is an error on the API enpoint or an error in my code.
I made sure I use the right codec in my app and that all the conversion of the audio is correct (samplerate, codec etc.)
My transcription always returns either an error or empty.

For your issue, maybe check if you’re converting audio right before sending it to the websocket.

I thought I did this but had no luck either.

Topic		Replies	Views
How can I pass a system prompt and audio user input to get a text output back? API	15	350	November 3, 2024
[Realtime API] AI Answering Gibberish API realtime , api-realtime , api-realtime-speech	9	386	October 25, 2024
How to make GPT (Voice) allow user more time to talk before replying GPT builders gpt-4 , api	24	2799	November 26, 2024
Unable to hear and bots can't hear me API api-realtime	0	28	December 2, 2024
Can (custom) GPT speak and respond via voice? Community gpt-4 , api , chatgpt-plugin	15	12453	September 29, 2024

Model gpt-4o-realtime-preview does not identify the voice of a recording

Related topics