[Realtime API] AI Answering Gibberish

j.wischnat · October 21, 2024, 11:59am

I can hear the AI just fine.

When I talk to the AI though, I just get gibberish or an unrelated answer.
1 out of 15 times, it sometimes answers my question correctly.
Sometimes it answers kind of on topic.
Example:

User - “What color is the sky?”
AI - “My favourite color is green.”

User - “What color is the sky?”
AI - “I can’t recognize speakers from a recording.”

It might be my issue of converting the users audio, I hope to get some help.

EricGT · October 21, 2024, 12:23pm

This is a common question. At the bottom of each Discourse topic is a section of related topics, one or more of them may be of value for your need.

j.wischnat · October 21, 2024, 12:28pm

Hey, thanks for the heads up. I already looked through all possible topics that are similar, however none of them have an answer for me.

I’m still hoping to resolve my issue with some assistance, I can even provide code or anything necessary as I’m stuck on this issue.

EricGT · October 21, 2024, 12:33pm

Thanks. I can not tell you how many people do not do that so kudos for you.

While I do not actively use the speech technology, have you looked at the OpenAI cookbook?

Note: A search for “speech” did not return a result in a hit but “tts” did. So if a search does not work, try something else or ever read all of the entries, one can learn a lot from them.

j.wischnat · October 21, 2024, 12:37pm

Thanks for the blazing fast reply, however this issue is related to the new Realtime API, so I won’t be needing any TTS models. I looked through the documentation and pretty much know it by heart at this point.

I’ve been stuck on this issue for over a week now and can’t seem to find a fix.

The only thing I found that could be remotely related is that my sample rate could be wrong, however that would just mean that the users voice is low-pitch and slow OR high-pitch and fast.

I tried all kinds of samplerates, 8000, 16000, 24000 and even non-sensical ones like 48000. None worked, I still get a gibberish response due to the AI probably not hearing the user correctly or just parts of it and so it defaults to hallucinating (like it does with short or gibberish inputs).

I just need help with the audio sending part.

Thank you for your time!

Edit: I’m working in Java, so I can’t really use JavaScript or Python libraries as reference, even though EVERYONE uses that as it makes the most sense, but for my use-case, I have to use Java.

EricGT · October 21, 2024, 12:41pm

Good question.

The only point I can note is that maybe this will get (noted to / seen) by the OpenAI employees working on this and they will make updates. However, do not expect them to respond as that happens maybe once in a few thousand post.

I am sure others would like to know what you tried, what failed and what gave partial or full success.

_j · October 21, 2024, 1:06pm

This does indeed seem like inability to understand the audio. You can see it spouting from post-training likely used to enforce both voice and behavior.

First, I would not try to send the 8 bit compressed telephony audio formats until you have this resolved. Use pcm16.

Then ensure the audio you are sending is at the correct sample rate. You can see this in example code.

    # resample; audio = AudioSegment object from pydub
    pcm_audio = audio.set_frame_rate(24000).set_channels(1).set_sample_width(2).raw_data
    
    # Encode to base64 string
    pcm_base64 = base64.b64encode(pcm_audio).decode()

That is how you must create audio input content segments.

You also can open the audio you are sending in an editor like Audacity, and see that it is being interpreted correctly at those same settings (and little-endian as is typical for PCM audio). The volume should also be loud enough.

You should have level controls in any UI to monitor that the user input device is loud enough under operating system control, and then also boost it by a fixed amount if a segment does not meet minimums. Too loud also can be a problem, as boosted background noise can make it past the voice activity detector (typical algorithms need 5-10 seconds of background noise to learn to reject environment levels).

Then, for create audio session, you also can set temperature. This parameter can start lower than a default of 1.0 to ensure reliability, such as 0.01 to diagnose the source of this issue.

Success in this area is a different kind of full-stack developer, one that has experience with audio technology, devices, programming and processing.

j.wischnat · October 21, 2024, 1:33pm

Okay, so as I’m developing for a German company, I tried the voice in German.

I now attempted the same code however speaking English.
This works flawlessly.

I imagine that my audio is still bad, however it can better “fill in the blanks” in English.

I tried exporting the audio that I’m sending to the websocket, however I’m having trouble doing this right. Either the AIs audio is pitched up or mine is.
However this just means I’m using the wrong samplerate. I know I have to use 24000Hz, mono, 16bit, 64base encoded chunks of audio frames, but I have no clue what I’m doing wrong.

I’m working with a SIP VoIP SDK, so the audio goes through that as well.
However when listening to the audio, everything sounds okay to me.

I am so stumped and wish I could resolve this seemingly simple issue.

I will post my code here as it might be a good starting point to help me with my issue.

Keep in mind, it’s Java, so you may want to use ChatGPT to convert to a more readable language or to let it summarize my code.

Thank you so much for your help!

import java.awt.event.KeyEvent;
import java.awt.event.KeyListener;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.WebSocket;
import java.net.http.WebSocket.Listener;
import java.util.Base64;
import java.util.concurrent.CompletionStage;
import java.util.concurrent.CountDownLatch;
import javax.sound.sampled.*;
import javax.swing.JFrame;
import org.json.JSONArray;
import org.json.JSONObject;
import io.github.cdimascio.dotenv.Dotenv;
import webphone.*;

public class WebSocketClient {
    private static final String URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview";
    private static final String AUTHORIZATION = "Bearer " + getApiKey();
    private static final String OPENAI_BETA = "realtime=v1";
    private static CountDownLatch latch = new CountDownLatch(1);
    public static WebSocket webSocket;
    private static webphone wobj;
    private static boolean isWebSocketConnected = false;
    private static ByteArrayOutputStream audioOutputStream = new ByteArrayOutputStream();
    private static ByteArrayOutputStream webSocketAudioReceived = new ByteArrayOutputStream();
    private static ByteArrayOutputStream webSocketAudioSent = new ByteArrayOutputStream();
    private static ByteArrayOutputStream phoneAudioReceived = new ByteArrayOutputStream();
    private static ByteArrayOutputStream phoneAudioSent = new ByteArrayOutputStream();

    public static void main(String[] args) {
        initializeSIP();
        startMediaStreaming();
        setupKeyListener();
        try {
            latch.await(); // Keep the main thread alive
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    // Load API key from environment variables
    private static String getApiKey() {
        Dotenv dotenv = Dotenv.load();
        return dotenv.get("OPENAI_API_KEY");
    }

    // Connect to WebSocket if not already connected
    public static void connectToWebSocket() {
        if (!isWebSocketConnected) {
            HttpClient client = HttpClient.newHttpClient();
            webSocket = client.newWebSocketBuilder()
                    .header("Authorization", AUTHORIZATION)
                    .header("OpenAI-Beta", OPENAI_BETA)
                    .buildAsync(URI.create(URL), new WebSocketListener())
                    .join();
            isWebSocketConnected = true;
        }
    }

    // Reconnect to WebSocket
    private static void reconnect() {
        System.out.println("Reconnecting.");
        isWebSocketConnected = false;
        connectToWebSocket();
    }

    // Initialize SIP settings
    private static void initializeSIP() {
        try {
            wobj = new webphone(0);
            MyNotificationListener listener = new MyNotificationListener();
            wobj.API_SetNotificationListener(listener);

            wobj.API_SetParameter("loglevel", 1);
            wobj.API_SetParameter("logtoconsole", true);
            wobj.API_SetParameter("serveraddress", "10.0.0.15");
            wobj.API_SetParameter("username", "AIPhone");
            wobj.API_SetParameter("password", "9aGi28axZrgbtimA");
            wobj.API_SetParameter("useaudiodevicerecord", false); // Disable recording from local audio device
            wobj.API_SetParameter("sendmedia_mode", 2); // Use API_GetMedia for media streaming
            wobj.API_SetParameter("sendmedia_atype", 3); // PCM 16-bit
            wobj.API_SetParameter("sendmedia_mtype", 1); // Audio only
            wobj.API_SetParameter("sendmedia_dir", 1); // Incoming only

            wobj.API_Start();
            Thread.sleep(200);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // Start media streaming in a separate thread
    private static void startMediaStreaming() {
        new Thread(() -> {
            while (true) {
                byte[] mediaData = wobj.API_GetMedia();
                if (mediaData != null && mediaData.length > 0) {
                    streamAudioToWebSocket(mediaData);
                    try {
                        phoneAudioReceived.write(mediaData);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                try {
                    Thread.sleep(10); // Sleep to avoid busy waiting
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }).start();
    }

    private static void sendTextToWS(String text) {
        if (isWebSocketConnected) {
            webSocket.sendText(new JSONObject()
                    .put("type", "response.create")
                    .put("response", new JSONObject()
                            .put("modalities", new JSONArray().put("text").put("audio"))
                            .put("instructions", text))
                    .toString(), true);
        }
    }

    // Stream audio data to WebSocket
    private static void streamAudioToWebSocket(byte[] audioBytes) {
        if (isWebSocketConnected) {
            byte[] resampledAudio = WebSocketListener.resampleAudio(audioBytes, 16000, 24000);
            String base64Audio = Base64.getEncoder().encodeToString(resampledAudio);
            webSocket.sendText(new JSONObject()
                    .put("type", "input_audio_buffer.append")
                    .put("audio", base64Audio)
                    .toString(), true);
            try {
                audioOutputStream.write(resampledAudio);
                webSocketAudioSent.write(resampledAudio);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    // Save audio streams to files
    private static void saveAudioToFile() {
        try {
            saveAudioStreamToFile(audioOutputStream, "FullConversation.wav", 24000);
            saveAudioStreamToFile(webSocketAudioReceived, "WebSocketAudioReceived.wav", 24000);
            saveAudioStreamToFile(webSocketAudioSent, "WebSocketAudioSent.wav", 24000);
            saveAudioStreamToFile(phoneAudioReceived, "PhoneAudioReceived.wav", 24000);
            saveAudioStreamToFile(phoneAudioSent, "PhoneAudioSent.wav", 24000);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // Save a specific audio stream to a file
    private static void saveAudioStreamToFile(ByteArrayOutputStream audioStream, String fileName, float sampleRate)
            throws IOException {
        byte[] audioData = audioStream.toByteArray();
        AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
        ByteArrayInputStream bais = new ByteArrayInputStream(audioData);
        AudioInputStream audioInputStream = new AudioInputStream(bais, format,
                audioData.length / format.getFrameSize());
        File wavFile = new File(fileName);
        AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, wavFile);
        audioInputStream.close();
        audioStream.reset();
    }

    // WebSocket listener class
    public static class WebSocketListener implements Listener {
        private StringBuilder messageBuffer = new StringBuilder();
        private ByteArrayOutputStream audioBuffer = new ByteArrayOutputStream();
        private String lastItemId = null;

        @Override
        public void onOpen(WebSocket webSocket) {
            System.out.println("Connected to WebSocket.");
            webSocket.request(1);
            webSocket.sendText(new JSONObject()
                .put("type", "response.create")
                .put("response", new JSONObject()
                    .put("modalities", new JSONArray().put("text").put("audio"))
                    .put("instructions", "Assist the user.")
                ).toString(), true);
            webSocket.sendText(new JSONObject()
                .put("type", "session.update")
                .put("session", new JSONObject()
                    .put("voice", "alloy")
                    .put("input_audio_format", "pcm16")
                    .put("output_audio_format", "pcm16")
                    .put("input_audio_transcription", new JSONObject()
                        .put("model", "whisper-1")
                    )
                ).toString(), true);
        }

        @Override
        public CompletionStage<?> onText(WebSocket webSocket, CharSequence data, boolean last) {
            messageBuffer.append(data);
            if (last) {
                try {
                    JSONObject event = new JSONObject(messageBuffer.toString());
                    if (event.has("type")) {
                        String eventType = event.getString("type");
                        switch (eventType) {
                            case "response.audio.delta":
                                String itemId = event.getString("item_id");
                                if (!itemId.equals(lastItemId)) {
                                    lastItemId = itemId;
                                    audioBuffer.reset(); // Clear the buffer for new item_id
                                }
                                String base64Audio = event.getString("delta");
                                byte[] audioBytes = Base64.getDecoder().decode(base64Audio);
                                audioBuffer.write(audioBytes);
                                streamAudioToPhone(audioBytes);
                                try {
                                    webSocketAudioReceived.write(audioBytes);
                                } catch (IOException e) {
                                    e.printStackTrace();
                                }
                                break;
                            case "conversation.item.input_audio_transcription.completed":
                                String transcript = event.getString("transcript");
                                System.out.println("Transcription completed: " + transcript);
                                break;
                            case "conversation.item.input_audio_transcription.failed":
                                JSONObject error = event.getJSONObject("error");
                                System.out.println("Transcription failed:");
                                error.keys().forEachRemaining(key -> {
                                    System.out.println(key + ": " + error.get(key));
                                });
                                break;
                            case "response.text.delta":
                                String deltaText = event.getString("delta");
                                System.out.println("Text delta received: " + deltaText);
                                break;
                            case "error":
                            JSONObject errorDetails = event.getJSONObject("error");
                            String errorType = errorDetails.getString("type");
                            String errorCode = errorDetails.optString("code", "N/A");
                            String errorMessage = errorDetails.getString("message");
                            String errorParam = errorDetails.optString("param", "N/A");
                            String errorEventId = errorDetails.optString("event_id", "N/A");

                            System.out.println("Error occurred:");
                            System.out.println("Type: " + errorType);
                            System.out.println("Code: " + errorCode);
                            System.out.println("Message: " + errorMessage);
                            System.out.println("Param: " + errorParam);
                            System.out.println("Event ID: " + errorEventId);
                            default:
                                //System.out.println("Unknown event type: " + eventType);
                                break;
                        }
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    messageBuffer.setLength(0); // Clear the buffer
                }
            }
            webSocket.request(1);
            return null;
        }

        @Override
        public void onError(WebSocket webSocket, Throwable error) {
            error.printStackTrace();
            reconnect();
        }

        @Override
        public CompletionStage<?> onClose(WebSocket webSocket, int statusCode, String reason) {
            System.out.println("Connection closed: " + reason);
            saveAudioToFile();
            latch.countDown();
            return null;
        }

        // Resample audio data to a different sample rate
        private static byte[] resampleAudio(byte[] audioData, float fromSampleRate, float toSampleRate) {
            try {
                AudioFormat originalFormat = new AudioFormat(fromSampleRate, 16, 1, true, false);
                AudioInputStream originalStream = new AudioInputStream(
                    new ByteArrayInputStream(audioData), originalFormat, audioData.length / originalFormat.getFrameSize());

                AudioFormat targetFormat = new AudioFormat(toSampleRate, 16, 1, true, false);
                AudioInputStream resampledStream = AudioSystem.getAudioInputStream(targetFormat, originalStream);

                ByteArrayOutputStream resampledOut = new ByteArrayOutputStream();
                byte[] buffer = new byte[1024];
                int bytesRead;
                while ((bytesRead = resampledStream.read(buffer)) != -1) {
                    resampledOut.write(buffer, 0, bytesRead);
                }

                return resampledOut.toByteArray();
            } catch (Exception e) {
                System.out.println("Failed to resample audio.");
                e.printStackTrace();
                return null;
            }
        }

        // Stream audio data to phone
        private static void streamAudioToPhone(byte[] audioBytes) {
            if (isWebSocketConnected) {
                // Resample audio from WebSocket sample rate (24000 Hz) to phone sample rate (8000 Hz)
                byte[] resampledAudio = resampleAudio(audioBytes, 24000, 8000);
                if (resampledAudio != null) {
                    wobj.API_StreamSoundBuff(1, -1, resampledAudio, resampledAudio.length); // Stream resampled audio buffer to phone
                    try {
                        phoneAudioReceived.write(resampledAudio);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
    }

    // SIP notification listener class
    public static class MyNotificationListener extends SIPNotificationListener {
        private boolean isCallConnected = false;

        @Override
        public void onStatus(SIPNotification.Status e) {
            if (e.getLine() == -1) return;
    
            if (e.getStatus() == SIPNotification.Status.STATUS_CALL_RINGING && e.getEndpointType() == SIPNotification.Status.DIRECTION_IN) {
                System.out.println("Incoming call from " + e.getPeerDisplayname());
                wobj.API_Accept(e.getLine());
            } else if (e.getStatus() == SIPNotification.Status.STATUS_CALL_CONNECT && e.getEndpointType() == SIPNotification.Status.DIRECTION_IN) {
                if (!isCallConnected) {
                    System.out.println("Incoming call connected");
                    connectToWebSocket();
                    isCallConnected = true;
                }
            } else if (e.getStatus() == SIPNotification.Status.STATUS_CALL_FINISHED) {
                if (isCallConnected) {
                    System.out.println("Call finished");
                    if (webSocket != null) {
                        webSocket.sendClose(WebSocket.NORMAL_CLOSURE, "Call ended");
                        isWebSocketConnected = false;
                    }
                    isCallConnected = false;
                }
            }
        }
    }

    // Setup key listener
    private static void setupKeyListener() {
        JFrame frame = new JFrame();
        frame.setSize(300, 200);
        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        frame.addKeyListener(new KeyListener() {
            @Override
            public void keyTyped(KeyEvent e) {}

            @Override
            public void keyPressed(KeyEvent e) {
                if (e.getKeyCode() == KeyEvent.VK_R) {
                    System.out.println("R key pressed");
                    sendTextToWS("Wer bist du?");
                }
            }

            @Override
            public void keyReleased(KeyEvent e) {}
        });
        frame.setVisible(true);
    }
}

For Documentation about the SIP:

https://www.mizu-voip.com/Portals/0/Files/documentation/jvoip/

j.wischnat · October 23, 2024, 9:24am

I fixed the error as it was due to a unnecessary sleep in my code. I can’t believe I overlooked such a simple mistake.

Thanks everyone.

vb · October 25, 2024, 9:25am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Realtime API only works partially API java , realtime , api-realtime , api-realtime-speech	8	1284	October 27, 2024
Error in committing input audio buffer because buffer has 0ms of audio API api , realtime	7	970	November 27, 2024
[Realtime API] Audio is randomly cutting off at the end Bugs realtime	81	5249	June 16, 2025
Python integration of real time? API	13	3620	October 5, 2024
Use new model for realtime audio transcription API transcribe	6	1614	June 5, 2025

[Realtime API] AI Answering Gibberish

Related topics