Sharing experiences about Realtime in the backend ☕

I have read several posts on this forum about problems with the Realtime API, especially two points:

  • Audio cuts off at the end of each AI response.
  • No audio transcript is received from the user.

I want to share my experience to overcome these problems. I want to warn that my experience is based on Q&A scenarios and the Java language in the backend.

Audio cuts off at the end of each AI response

After you finish speaking, you send a response.create request, then the AI ​​sends audio fragments via several response.audio.delta events and sends a response.audio.done event when it finishes. You then play each audio delta and stop the speakers when a finished audio arrives. Because handling audio deltas takes longer, it is necessary to give a small delay before stopping the speakers. This solved the problem I was experiencing.

User audio transcription not received

To be honest, I haven’t experienced this problem, but maybe some of you have forgotten to configure the audio transcription model whisper-1 or to handle user audio transcription events asynchronously, although this could be related to the library used.

Demo code

Below I share a demo code to handle a question-answer interaction based on the Realtime API in Java and the simple-openai library. There you can see a commented code that successfully overcomes both problems:

public class RealtimeDemo {

    private static final int BUFFER_SIZE = 8192; // Size of audio data chunks

    public static void main(String[] args) throws LineUnavailableException {
        var sound = new Sound(); // Initialize audio input/output

        // Initialize OpenAI client with API key and Realtime configuration
        var openAI = SimpleOpenAI.builder()
                .apiKey(System.getenv("OPENAI_API_KEY")) // Get API key from environment variable
                .realtimeConfig(RealtimeConfig.of("gpt-4o-mini-realtime-preview")) // Set the model
                .build();

        // Configure the Realtime session
        var session = RealtimeSession.builder()
                .modality(Modality.AUDIO) // Enable audio input/output
                .modality(Modality.TEXT) // Enable text input/output
                .instructions("Respond with short, direct sentences.") // Initial instructions for the AI
                .voice(RealtimeSession.VoiceRealtime.ECHO) // Set the AI voice
                .outputAudioFormat(RealtimeSession.AudioFormatRealtime.PCM16) // Set output audio format
                .inputAudioTranscription(RealtimeSession.InputAudioTranscription.of("whisper-1")) // Set transcription model
                .temperature(0.9) // Set temperature for AI responses
                .build();

        var realtime = openAI.realtime(); // Get the Realtime API instance

        // Event handler for audio deltas (chunks of audio from the AI)
        realtime.onEvent(ServerEvent.ResponseAudioDelta.class, event -> {
            var dataBase64 = Base64.getDecoder().decode(event.getDelta()); // Decode Base64 audio data
            sound.speaker.write(dataBase64, 0, dataBase64.length); // Play the audio
        });

        // Event handler for the end of an audio response
        realtime.onEvent(ServerEvent.ResponseAudioDone.class, event -> {
            delay(1000); // Short delay to ensure all audio is received
            sound.speaker.stop(); // Stop playback
            sound.speaker.drain(); // Flush any remaining audio
        });

        // Event handler for the completion of audio transcription
        realtime.onEvent(ServerEvent.ResponseAudioTranscriptDone.class, event -> {
            System.out.println(event.getTranscript()); // Print the transcribed text
            askForSpeaking(); // Prompt the user to speak again
        });

        // Event handler for the completion of audio transcription for the user's question
        realtime.onEvent(ServerEvent.ConversationItemAudioTransCompleted.class, event -> {
            System.out.print("Your question was: ");
            System.out.println(event.getTranscript()); // Print the user's transcribed question
        });

        // Establish the real-time connection and send the session configuration
        realtime.connect()
                .thenCompose(v -> realtime.send(ClientEvent.SessionUpdate.of(session)))
                .join();

        System.out.println("Connection established!");
        System.out.println("(Press any key and Return to terminate)");

        Scanner scanner = new Scanner(System.in);
        askForSpeaking(); // Prompt the user to speak

        // Main loop for recording and sending audio
        while (true) {
            sound.microphone.start(); // Start recording
            AtomicBoolean isRecording = new AtomicBoolean(true); // Flag to control recording

            // Asynchronous task for recording and sending audio data
            CompletableFuture<Void> recordingFuture = CompletableFuture.runAsync(() -> {
                byte[] data = new byte[BUFFER_SIZE];
                while (isRecording.get()) {
                    int bytesRead = sound.microphone.read(data, 0, data.length); // Read audio data
                    if (bytesRead > 0) {
                        var dataBase64 = Base64.getEncoder().encodeToString(data); // Encode to Base64
                        realtime.send(ClientEvent.InputAudioBufferAppend.of(dataBase64)); // Send audio data
                        delay(10); // Small delay to prevent overwhelming the API
                    }
                }
            });

            var keyPressed = scanner.nextLine(); // Wait for user input (Enter key)
            isRecording.set(false); // Stop recording
            if (keyPressed.isEmpty()) { // If Enter key is pressed
                sound.microphone.stop();
                sound.microphone.drain();
                recordingFuture.join(); // Wait for recording task to finish
                realtime.send(ClientEvent.ResponseCreate.of(null)); // Signal end of user input
                System.out.println("Waiting for AI response...\n");
                sound.speaker.start(); // Start playback
            } else { // If any other key is pressed
                recordingFuture.join();
                break; // Exit the loop
            }
        }

        scanner.close(); // Close the scanner
        sound.cleanup(); // Clean up audio resources
        realtime.disconnect(); // Disconnect from the API
        openAI.shutDown(); // Shut down SimpleOpenAI
    }

    private static void askForSpeaking() {
        System.out.println("\nSpeak your question (press Return when done):");
    }

    private static void delay(int milliseconds) {
        try {
            Thread.sleep(milliseconds);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }

    // Inner class for managing audio input and output
    public static class Sound {

        private static final float SAMPLE_RATE = 24000f;
        private static final int SAMPLE_SIZE_BITS = 16;
        private static final int CHANNELS = 1;
        private static final boolean SIGNED = true;
        private static final boolean BIG_ENDIAN = false;

        private TargetDataLine microphone;
        private SourceDataLine speaker;

        public Sound() throws LineUnavailableException {
            AudioFormat format = new AudioFormat(
                    SAMPLE_RATE,
                    SAMPLE_SIZE_BITS,
                    CHANNELS,
                    SIGNED,
                    BIG_ENDIAN);

            // Initialize microphone
            DataLine.Info micInfo = new DataLine.Info(TargetDataLine.class, format);
            if (!AudioSystem.isLineSupported(micInfo)) {
                throw new LineUnavailableException("Microphone not supported");
            }
            microphone = (TargetDataLine) AudioSystem.getLine(micInfo);
            microphone.open(format);

            // Initialize speaker
            DataLine.Info speakerInfo = new DataLine.Info(SourceDataLine.class, format);
            if (!AudioSystem.isLineSupported(speakerInfo)) {
                throw new LineUnavailableException("Speakers not supported");
            }
            speaker = (SourceDataLine) AudioSystem.getLine(speakerInfo);
            speaker.open(format);
        }

        // Release audio resources
        public void cleanup() {
            microphone.stop();
            microphone.drain();
            microphone.close();

            speaker.stop();
            speaker.drain();
            speaker.close();
        }
    }
}
2 Likes