Playing audio in JS sent from realtime API

Is there any documentation on how to actually play (in JS) the audio deltas sent from Realtime API?

I’m sending each audio delta to my frontend, then for each audio delta I’m converting to ArrayBuffer, and I’m trying decode with audioContext.decodeAudioData(arrayBuffer), but I’m getting an error saying it can’t decode.

Thanks

2 Likes

Same, I created a buffer out of each and then I concatened them, then I saved the file but it just won’t read

1 Like

I’m gonna try concatenating the deltas before the arraybuffer conversion and see what happens. Surprised it’s so hard to find a working example of this

1 Like

Basically, you’re trying to get these “audio deltas” (little chunks of sound) sent by the API to actually play on the browser, right? The issue you’re hitting comes from the fact that the audio isn’t in a format that your browser can decode easily, like a regular WAV or MP3. Instead, it’s probably in something more compressed, like Opus or Ogg, and the browser’s AudioContext.decodeAudioData() can’t handle that directly.

So the fix here is: you’ll need to decode the audio first before the browser can play it. There are libraries out there, like Opus.js or Aurora.js, that can help with this. They take the compressed audio and turn it into a format your browser can handle, like PCM (the basic audio data browsers use).

Once you’ve decoded it, you can pass the data to the AudioContext and it’ll play smoothly. It’s like you’re translating the audio into a language your browser understands.

In short:

  1. Decode the audio delta using a library.
  2. Play it with AudioContext after decoding.

I used to do OGG in IMVU I sold music rings. :rabbit::honeybee:

1 Like

I have the audio_output_format set to ‘pcm16’. So shouldn’t i be able to to play this (once converted to ArrayBuffer) in audioContext with decodeAudioData(ArrayBuffer)?

Thanks

4 Likes

Hey, I got it working. The key for me was that neither decodeAudioData() or ‘audio-decode’ npm library can work with raw PCM. You need to wrap in a WAV container.

  1. (audio delta from OpenAI) Base64 PCM16 → binary (String)… You can use atob()
  2. Binary string → ArrayBuffer (ChatGPT can write you a function)
  3. Create wavHeader ArrayBuffer (ChatGPT can write you a function… I used 24000 samplingRate)
  4. Concat wavHeader ArrayBuffer + PCM16 ArrayBuffer created in step 2
  5. The array buffer from step 4 is what I send via WebSocket to my client (make sure the ws ‘type’ of client connection is ‘arraybuffer’
  6. Decode what I receive on the client (resulting ArrayBuffer from step 4)

Hope this helps. Let me know if you have issues

4 Likes

You’re a genius :heart: :partying_face:
It works wonders ! :rocket: thank you very much, I guess the wavHeader was the secret ingredient we didn’t know about !

1 Like

figured it out

to listen:

export async function startRealtimeListening() {
    try {
        // Start capturing audio
        const stream = await navigator.mediaDevices.getUserMedia({
            audio: {
                sampleRate: 16000,
                channelCount: 1,
                echoCancellation: true,
            },
        });

        const audioContext = new AudioContext({sampleRate: 16000});
        const source = audioContext.createMediaStreamSource(stream);
        const processor = audioContext.createScriptProcessor(4096, 1, 1);

        processor.onaudioprocess = (event) => {
            const audioData = event.inputBuffer.getChannelData(0);

            const int16Buffer = convertFloat32ToInt16(audioData);
            Game.realtimeClient.appendInputAudio(int16Buffer);  // Send audio data to RealtimeClient
        };

        source.connect(processor);
        processor.connect(audioContext.destination);
        Game.isListening = true;
    } catch (error) {
        console.error(error);
        Game.isListening = false;
    }
}

export function convertFloat32ToInt16(buffer) {
    let l = buffer.length;
    const buf = new Int16Array(l);
    while (l--) {
        buf[l] = Math.min(1, buffer[l]) * 0x7FFF;
    }
    return buf.buffer;
}

to speak:


class AudioQueueManager {
    audioQueue = [];
    isPlaying = false;
    pitchFactor = 0.5; // Default pitch factor for the voices. you can make it sound godlike if run at 0.2

    constructor() {
        makeAutoObservable(this);
    }

    // Function to set the pitch factor
    setPitchFactor(factor) {
        this.pitchFactor = factor;
    }

    // Function to add audio data to the queue
    addAudioToQueue(audioData) {
        this.audioQueue.push(audioData);
        this.playNext(); // Start playing if not already playing
    }

    // Function to play the next audio chunk in the queue
    async playNext() {
        if (this.isPlaying || this.audioQueue.length === 0) return;

        this.isPlaying = true;

        const audioData = this.audioQueue.shift(); // Get the next audio chunk
        await this.playAudio(audioData); // Play the audio

        this.isPlaying = false;
        this.playNext(); // Play the next audio in the queue
    }

    // Function to play a single audio chunk with pitch adjustment
    playAudio(audioBuffer) {
        return new Promise((resolve) => {
            // Create an AudioContext
            const audioContext = new (window.AudioContext || window.webkitAudioContext)();

            // Convert Int16Array (PCM16) to Float32Array
            const float32Array = new Float32Array(audioBuffer.length);
            for (let i = 0; i < audioBuffer.length; i++) {
                float32Array[i] = audioBuffer[i] / 0x7FFF; // Normalize to -1.0 to 1.0
            }

            // Create an AudioBuffer
            const audioBufferObj = audioContext.createBuffer(1, float32Array.length, audioContext.sampleRate);
            audioBufferObj.copyToChannel(float32Array, 0); // Copy PCM data to the buffer

            // Create a BufferSource to play the audio
            const source = audioContext.createBufferSource();
            source.buffer = audioBufferObj;
            source.playbackRate.value = this.pitchFactor; // Adjust pitch if necessary
            source.connect(audioContext.destination);

            source.onended = () => {
                resolve(); // Resolve when the playback ends
            };

            source.start(0); // Start playback
        });
    }
}

// Create an instance of the audio queue manager
const audioQueueManager = new AudioQueueManager();

// Function to handle the conversation update events
function handleConversationUpdated(event) {
    const { item, delta } = event;

    if (delta?.audio) {
        audioQueueManager.addAudioToQueue(delta.audio); // Add incoming audio chunk to the queue
    }
}
1 Like

Hi, Could you post your code?
I followed the instructions but was not able to get it working.
I’m using native ws implementation and picking up response.audio.delta event.

else if (data.type === "response.audio.delta") {
    audioChunks.push(data.delta);
    decodeAudio(data.delta);
}
1 Like

I got it to play back, but there’s a constant clicking sound. Perhaps there’s a mismatched sample rate, but I have tried several things, and it’s still there.

Could you take a look at my code and see if you spot something? Thank you!

From the server I send

          const audioDelta = {
                            event: 'media',
                            streamSid: streamSid,
                            media: { payload: Buffer.from(response.delta, 'base64').toString('base64') }
                        };
                        connection.send(JSON.stringify(audioDelta));

Then on client:

 if (data.media.payload) {
              const audioData = new Uint8Array(atob(data.media.payload).split("").map(c => c.charCodeAt(0)));
              const wavData = createWavData(audioData, 24000);

              const init16Array = new Int16Array(wavData.buffer);

              audioQueue.push(init16Array)

              playNext();
            }
1 Like

It looks like you’re trying to decode the Base64 encoded PCM16 sent to you by the Realtime API. I’d take another look at steps 1 - 4.

  1. Decode from Base64 to binary string using atob()
  2. Convert from binary string to array buffer
  3. Create a wav header array buffer
  4. Concatenate wavHeaderArrayBuffer + audioArrayBuffer

Decode the resulting ArrayBuffer from step 4

2 Likes

Using a ScriptProcessor Node is deprecated.

I recommend cloning the OpenAI Console demo app. They have working audio streaming code in the libs folder.

1 Like

Can someone help me write a function to resample the audio from 24khz to 8hz PCM and vice versa. I am currently using ffmpeg library generated by ChatGPT but its not perfect. There is some latency and some noise.

There is a direct way too I am assuming.
@aistuff posted some steps which are helpful but someone have an implementation ?

Here is the code

 try {
    return new Promise((resolve, reject) => {
      const inputBuffer2 = Buffer.from(inputBuffer, 'base64');
      const inputStream = Readable.from(inputBuffer2);
      let outputBuffer = Buffer.alloc(0);
      const outputStream = new PassThrough();

      ffmpeg(inputStream)
        .inputFormat('s16le')
        .inputOptions([
          `-ar ${inputFrequency}`, // Dynamic input sample rate
          '-ac 1',
        ])
        .outputFormat('s16le')
        .audioFrequency(outputFrequency) // Dynamic output sample rate
        .audioChannels(1) // Mono output
        .on('error', (err) => {
          console.error('FFmpeg error:', err.message);
          reject(new Error(`FFmpeg error: ${err.message}`));
        })
        .on('end', () => {
          resolve(outputBuffer.toString('base64'));
        })
        .writeToStream(outputStream);

      outputStream.on('data', (chunk) => {
        outputBuffer = Buffer.concat([outputBuffer, chunk]);
      });

      const timeout = setTimeout(() => {
        reject(new Error('Audio processing timeout'));
        outputStream.destroy();
      }, 60000);

      outputStream.on('end', () => clearTimeout(timeout));
    });
  }```

Use sox, generated this code from claude.

const sox = require('sox-stream');
const { PassThrough } = require('stream');
const { Buffer } = require('buffer');

/**
 * Converts PCM audio with volume normalization
 * @param {string} base64String - Base64 encoded PCM audio data
 * @param {Object} options - Conversion options
 * @param {number} options.bits - Bit depth of input/output (default: 16)
 * @param {number} options.channels - Number of channels (default: 1)
 * @param {number} options.volumeAdjustment - Volume adjustment in dB (default: -3)
 * @returns {Promise<string>} Base64 encoded 48kHz PCM audio data
 */
async function resampleAudio(base64String, targetSampleRate, originalSampleRate, {
    bits = 16,
    channels = 1,
    volumeAdjustment = -3
} = {}) {
    if (!base64String || typeof base64String !== 'string') {
        throw new Error('Invalid input: base64String must be a non-empty string');
    }

    return new Promise((resolve, reject) => {
        try {
            // Decode base64 to Buffer
            const inputBuffer = Buffer.from(base64String, 'base64');

            // Create streams
            const inputStream = new PassThrough();
            const outputStream = new PassThrough();

            // Configure sox transform with volume adjustment and rate conversion
            const transform = sox({
                input: {
                    type: 'raw',
                    rate: originalSampleRate,
                    channels: channels,
                    bits: bits,
                    encoding: 'signed-integer',
                },
                output: {
                    type: 'raw',
                    rate: targetSampleRate,
                    channels: channels,
                    bits: bits,
                    encoding: 'signed-integer',
                },
                // Add volume adjustment before rate conversion
                effects: [
                    ['vol', `${volumeAdjustment}dB`],  // Reduce volume to prevent clipping
                    ['rate', '-v', '-L', `${targetSampleRate}`]      // High-quality rate conversion
                ]
            });

            // Set up error handlers
            const handleError = (err) => {
                // Ignore rate warning messages about clipping
                if (err.message && err.message.includes('sox WARN rate')) {
                    return;
                }
                cleanup();
                reject(new Error(`Sample rate conversion failed: ${err.message}`));
            };

            inputStream.on('error', handleError);
            transform.on('error', handleError);
            outputStream.on('error', handleError);

            // Collect output chunks
            const chunks = [];
            outputStream.on('data', chunk => chunks.push(chunk));

            outputStream.on('end', () => {
                try {
                    const resampledBuffer = Buffer.concat(chunks);
                    const outputBase64String = resampledBuffer.toString('base64');
                    cleanup();
                    resolve(outputBase64String);
                } catch (err) {
                    handleError(err);
                }
            });

            // Cleanup function to remove listeners
            const cleanup = () => {
                inputStream.removeAllListeners();
                transform.removeAllListeners();
                outputStream.removeAllListeners();
            };

            // Start the processing pipeline
            inputStream.end(inputBuffer);
            inputStream
                .pipe(transform)
                .pipe(outputStream);

        } catch (err) {
            reject(new Error(`Failed to initialize sample rate conversion: ${err.message}`));
        }
    });
}