I have read several posts on this forum about problems with the Realtime API, especially two points:
- Audio cuts off at the end of each AI response.
- No audio transcript is received from the user.
I want to share my experience to overcome these problems. I want to warn that my experience is based on Q&A scenarios and the Java language in the backend.
Audio cuts off at the end of each AI response
After you finish speaking, you send a response.create
request, then the AI sends audio fragments via several response.audio.delta
events and sends a response.audio.done
event when it finishes. You then play each audio delta and stop the speakers when a finished audio arrives. Because handling audio deltas takes longer, it is necessary to give a small delay before stopping the speakers. This solved the problem I was experiencing.
User audio transcription not received
To be honest, I haven’t experienced this problem, but maybe some of you have forgotten to configure the audio transcription model whisper-1
or to handle user audio transcription events asynchronously, although this could be related to the library used.
Demo code
Below I share a demo code to handle a question-answer interaction based on the Realtime API in Java and the simple-openai library. There you can see a commented code that successfully overcomes both problems:
public class RealtimeDemo {
private static final int BUFFER_SIZE = 8192; // Size of audio data chunks
public static void main(String[] args) throws LineUnavailableException {
var sound = new Sound(); // Initialize audio input/output
// Initialize OpenAI client with API key and Realtime configuration
var openAI = SimpleOpenAI.builder()
.apiKey(System.getenv("OPENAI_API_KEY")) // Get API key from environment variable
.realtimeConfig(RealtimeConfig.of("gpt-4o-mini-realtime-preview")) // Set the model
.build();
// Configure the Realtime session
var session = RealtimeSession.builder()
.modality(Modality.AUDIO) // Enable audio input/output
.modality(Modality.TEXT) // Enable text input/output
.instructions("Respond with short, direct sentences.") // Initial instructions for the AI
.voice(RealtimeSession.VoiceRealtime.ECHO) // Set the AI voice
.outputAudioFormat(RealtimeSession.AudioFormatRealtime.PCM16) // Set output audio format
.inputAudioTranscription(RealtimeSession.InputAudioTranscription.of("whisper-1")) // Set transcription model
.temperature(0.9) // Set temperature for AI responses
.build();
var realtime = openAI.realtime(); // Get the Realtime API instance
// Event handler for audio deltas (chunks of audio from the AI)
realtime.onEvent(ServerEvent.ResponseAudioDelta.class, event -> {
var dataBase64 = Base64.getDecoder().decode(event.getDelta()); // Decode Base64 audio data
sound.speaker.write(dataBase64, 0, dataBase64.length); // Play the audio
});
// Event handler for the end of an audio response
realtime.onEvent(ServerEvent.ResponseAudioDone.class, event -> {
delay(1000); // Short delay to ensure all audio is received
sound.speaker.stop(); // Stop playback
sound.speaker.drain(); // Flush any remaining audio
});
// Event handler for the completion of audio transcription
realtime.onEvent(ServerEvent.ResponseAudioTranscriptDone.class, event -> {
System.out.println(event.getTranscript()); // Print the transcribed text
askForSpeaking(); // Prompt the user to speak again
});
// Event handler for the completion of audio transcription for the user's question
realtime.onEvent(ServerEvent.ConversationItemAudioTransCompleted.class, event -> {
System.out.print("Your question was: ");
System.out.println(event.getTranscript()); // Print the user's transcribed question
});
// Establish the real-time connection and send the session configuration
realtime.connect()
.thenCompose(v -> realtime.send(ClientEvent.SessionUpdate.of(session)))
.join();
System.out.println("Connection established!");
System.out.println("(Press any key and Return to terminate)");
Scanner scanner = new Scanner(System.in);
askForSpeaking(); // Prompt the user to speak
// Main loop for recording and sending audio
while (true) {
sound.microphone.start(); // Start recording
AtomicBoolean isRecording = new AtomicBoolean(true); // Flag to control recording
// Asynchronous task for recording and sending audio data
CompletableFuture<Void> recordingFuture = CompletableFuture.runAsync(() -> {
byte[] data = new byte[BUFFER_SIZE];
while (isRecording.get()) {
int bytesRead = sound.microphone.read(data, 0, data.length); // Read audio data
if (bytesRead > 0) {
var dataBase64 = Base64.getEncoder().encodeToString(data); // Encode to Base64
realtime.send(ClientEvent.InputAudioBufferAppend.of(dataBase64)); // Send audio data
delay(10); // Small delay to prevent overwhelming the API
}
}
});
var keyPressed = scanner.nextLine(); // Wait for user input (Enter key)
isRecording.set(false); // Stop recording
if (keyPressed.isEmpty()) { // If Enter key is pressed
sound.microphone.stop();
sound.microphone.drain();
recordingFuture.join(); // Wait for recording task to finish
realtime.send(ClientEvent.ResponseCreate.of(null)); // Signal end of user input
System.out.println("Waiting for AI response...\n");
sound.speaker.start(); // Start playback
} else { // If any other key is pressed
recordingFuture.join();
break; // Exit the loop
}
}
scanner.close(); // Close the scanner
sound.cleanup(); // Clean up audio resources
realtime.disconnect(); // Disconnect from the API
openAI.shutDown(); // Shut down SimpleOpenAI
}
private static void askForSpeaking() {
System.out.println("\nSpeak your question (press Return when done):");
}
private static void delay(int milliseconds) {
try {
Thread.sleep(milliseconds);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
// Inner class for managing audio input and output
public static class Sound {
private static final float SAMPLE_RATE = 24000f;
private static final int SAMPLE_SIZE_BITS = 16;
private static final int CHANNELS = 1;
private static final boolean SIGNED = true;
private static final boolean BIG_ENDIAN = false;
private TargetDataLine microphone;
private SourceDataLine speaker;
public Sound() throws LineUnavailableException {
AudioFormat format = new AudioFormat(
SAMPLE_RATE,
SAMPLE_SIZE_BITS,
CHANNELS,
SIGNED,
BIG_ENDIAN);
// Initialize microphone
DataLine.Info micInfo = new DataLine.Info(TargetDataLine.class, format);
if (!AudioSystem.isLineSupported(micInfo)) {
throw new LineUnavailableException("Microphone not supported");
}
microphone = (TargetDataLine) AudioSystem.getLine(micInfo);
microphone.open(format);
// Initialize speaker
DataLine.Info speakerInfo = new DataLine.Info(SourceDataLine.class, format);
if (!AudioSystem.isLineSupported(speakerInfo)) {
throw new LineUnavailableException("Speakers not supported");
}
speaker = (SourceDataLine) AudioSystem.getLine(speakerInfo);
speaker.open(format);
}
// Release audio resources
public void cleanup() {
microphone.stop();
microphone.drain();
microphone.close();
speaker.stop();
speaker.drain();
speaker.close();
}
}
}