Audio Transcription API chunking_strategy option

simonl · May 14, 2025, 1:18pm

I was checking the audio transcription API reference and found that there is a new chunking_strategy option. Since the API is using form data instead of JSON format. I am wondering what would the form data format look like when chunking_strategy needs to be set to server_vad? Do we encode the entire object as JSON string? or we send individual fields based on the form data format?

Thanks!

_j · May 14, 2025, 2:09pm

This relates to not input, but how audio would be processed BTW.

Input is split server-side, and then each is separately run.

It doesn’t discuss if this is just an efficiency, or if you’d get back something different. Since it’s only returning text, I’m thinking the text is just appended for the transcript.

The first thing, “auto” doesn’t work for chunking_strategy, another case of wrong documentation.

WIth this option, I simply get no text content, an empty JSON response. Over and over with different options, nothing. The options are received, because setting threshold out of range will return a 400 Bad Request.

{"text":""}

Here’s how I constructed the call, where even commenting out the last parameter response_format makes no difference:

import os
import asyncio
import httpx
import aiofiles

async def async_transcribe_audio(input_file_path: str, output_file_path: str):
    url = "https://api.openai.com/v1/audio/transcriptions"
    headers = {
        "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
    }
    
    chunking=r"""{
"type": "server_vad",
"prefix_padding_ms": 200,
"silence_duration_ms": 200,
"threshold": 0.1
}"""
    # Using httpx to handle file upload without loading the whole file into memory
    async with httpx.AsyncClient(timeout=600) as client:
        files = {
            'file': (input_file_path.split('/')[-1], open(input_file_path, 'rb'), 'audio/mpeg'),
            'model': (None, 'gpt-4o-transcribe'),
            #'language': (None, 'en'),
            #'prompt': (None, "Welcome to our radio show."),
            'response_format': (None, 'json'),
            #'temperature': (None, '0.2')
            'chunking_strategy': (None, chunking),
        }
        try:
            response = await client.post(url, headers=headers, files=files)
            response.raise_for_status()  # Will raise an exception for HTTP error responses
            transcription = response.json()
            response_body = response.text
        except Exception as e:
            print(f"An API error occurred: {e}")
            raise
        finally:
            files['file'][1].close()  # Ensure we close the file

Maybe it is one more thing for OpenAI to shut off with ID verification.

simonl · May 14, 2025, 2:39pm

@_j
Thank you for your input! I think I understand what the parameter will do. I am asking how the chunking_strategy parameter needs to be formatted, because the API request is using Content-Type: multipart/form-data instead of Content-Type: application/json. I see you are calling it with just asyncio in Python, I am curious if you can get the server_vad setting to work on your end with the python script provided?

Figured if just auto doesn’t work, maybe it is {"type": "auto"} instead? Or maybe somehow it needs to be in a JSON string, aka "auto" ?

_j · May 14, 2025, 2:52pm

The API documentation says [ str | object ] - but it doesn’t take the string “auto”.

Your idea:
"error": { "message": "Invalid value: 'auto'. Value must be 'server_vad'."

As I said, the call is constructed and received without error, sending over my audio that works without chunking_strategy added.

Nothing is returned for the “text” value. No bill for tokens.

Also tried a single line JSON object, in case something after API validation needs that.

Maybe its not implemented yet.

more:

…now “auto” is validating - and also returning nothing instead of a transcript.
Someone at OpenAI eavesdropping?

sps · May 16, 2025, 1:27pm

Hi @simonl

The chunking_strategy param is supposed to be used when streaming transcriptions with gpt-4o-transcribe or gpt-4o-mini-transcribe.

As described in the API Reference, when set to auto , the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries.
If unset, the audio is transcribed as a single block.

Thus if you want to stream audio transcription chunks, you have to set it to auto when setting stream=True.

It’s not yet implemented in the python helper library, so you’ll have to use it like in the following code:

from openai import OpenAI

client = OpenAI()
audio_file = open("/path/to/file/speech.mp3", "rb")

stream = client.audio.transcriptions.create(
  model="gpt-4o-mini-transcribe", 
  file=audio_file, 
  response_format="text",
  stream=True,
  extra_body={"chunking_strategy": "auto"}
)

for event in stream:
  print(event)

simonl · May 16, 2025, 2:30pm

Thank you! Can I ask if I want to use server_vad type for chunking, how should I pass in the parameter now? Should I serialize the object into JSON string and pass in extra_body? It seems like @_j was doing the same but it wasn’t working for him.

I am using gpt-4o-mini-transcribe and want to implement the chunking strategy in a Swift library, since I am working on an iOS App. It would be good to know the exact parameters to pass in for the multipart/form-data POST body for the API.

_j · May 16, 2025, 3:10pm

It is simply not implemented and not working.

Streaming SSE returns the same empty object. Nothing.

Doesn’t matter if using “auto” for their chunking method or an object for your customization. Nothing.

data: {"type":"transcript.text.done","text":""}

data: [DONE]

Feel free to get your own audio version of “nothing” as it will not cost you any tokens and nothing is being done:

import os
import asyncio
import httpx
import aiofiles
import json

response_body = ""

async def async_transcribe_audio(input_file_path: str, output_file_path: str):
    global response_body  # for diagnosis at REPL console
    url = "https://api.openai.com/v1/audio/transcriptions"
    headers = {
        "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
    }
    chunking = json.dumps({
        "type": "server_vad",
        "prefix_padding_ms": 200,
        "silence_duration_ms": 200,
        "threshold": 0.1
    })
    files = {
        'file': (os.path.basename(input_file_path), open(input_file_path, 'rb'), 'audio/mpeg'),
        'model': (None, 'gpt-4o-transcribe'),
        'language': (None, 'en'),
        'prompt': (None, "Welcome to our radio show."),
        'response_format': (None, 'json'),
        'temperature': (None, '0.2'),
        #'chunking_strategy': (None, chunking),
        'chunking_strategy': (None, "auto"),
        'stream': (None, 'true'), # This should return a stream of SSE events??
    }
    async with httpx.AsyncClient(timeout=600) as client:
        try:
            async with client.stream("POST", url, headers=headers, files=files) as response:
                response.raise_for_status()

                async for line in response.aiter_lines():
                    response_body += line + '\n'

        except httpx.HTTPStatusError as exc:
            print(f"HTTP error occurred: {exc}")
            response_body += f"\nHTTP error occurred: {exc}\n{exc.response.text}\n"
        except Exception as e:
            print(f"An API error occurred: {e}")
            response_body += f"\nAPI error occurred: {e}\n"
        finally:
            files['file'][1].close()

    # Save the appended response body asynchronously for logging quality of SSE
    try:
        async with aiofiles.open(output_file_path, "w") as file:
            await file.write(response_body)
        print(f"--- RAW SSE stream saved to '{output_file_path}'.")
    except Exception as e:
        print(f"Output file error: {e}")

    '''
    # Extract the transcribed text - process the complete captured stream
    # or do it with file context manager...
    transcribed_text = transcription['text']  # Adjusted for actual API response
    
    # Save the transcribed text to a file asynchronously
    try:
        async with aiofiles.open(output_file_path, "w") as file:
            await file.write(transcribed_text)
        print(f"--- Transcribed text successfully saved to '{output_file_path}'.")
    except Exception as e:
        print(f"Output file error: {e}")
    '''


async def main():
    input_file_path = "audio1.mp3"
    output_file_path = input_file_path + "-transcript.txt"
    await async_transcribe_audio(input_file_path, output_file_path)

if __name__ == "__main__":
    asyncio.run(main())

sps · May 16, 2025, 7:16pm

Seems to be working
This image shows a Visual Studio Code window with a simple Python script (Captioned by AI)|1024 × 576

ETA: Here’s the output with extra_body={"chunking_strategy": {"type":"server_vad"}}

TranscriptionTextDeltaEvent(delta='Houston', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' on', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' two', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I've", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' got', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' a', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' question', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' about', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Star', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='liner', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Houston', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta="'s", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' with', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Go', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' ahead', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" There's", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' a', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' strange', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' noise', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' coming', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' through', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' speaker', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' I', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" didn't", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' know', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' if', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' could', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' connect', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' into', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Star', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='liner', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' let', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' me', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' key', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' mic', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' let', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' hear', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' it', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' I', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" don't", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' know', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" what's", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' making', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' it', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' but', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' I', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" don't", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' know', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' if', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" it's", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' something', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" that's", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' connected', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' between', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' here', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' there', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' making', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' that', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' happen', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' anyway', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' can', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' do', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' that', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='?', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' We', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' can', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' configure', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' that', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Give', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' us', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' a', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' minute', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I'll", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' call', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' back', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' when', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" it's", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' ready', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Station', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Houston', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' on', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' two', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" We're", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' configured', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' for', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' audio', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' via', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' hard', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' line', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' CST', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' if', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' want', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' give', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' us', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' a', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' call', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Okay', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I'm", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' in', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Star', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='liner', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' How', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' do', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' read', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='?', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Five', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' by', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' five', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' How', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' me', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='?', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Okay', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I'm", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' going', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' key', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' mic', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' up', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' next', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' speaker', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Copy', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Hear', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' that', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='?', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' At', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' negative', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' We', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' did', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' not', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' hear', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' anything', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' All', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' right', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' that', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' one', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' came', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' through', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' It', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' was', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' kind', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' of', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' like', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' a', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' pul', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='sing', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' noise', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' almost', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' like', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' a', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' sonar', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' ping', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I'll", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' do', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' it', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' one', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' more', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' time', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I'll", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' let', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' y', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta="'all", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' scratch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' your', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' heads', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' see', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' if', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' can', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' figure', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' out', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" what's", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' going', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' on', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Here', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' we', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' go', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' All', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' right', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' All', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' right', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' over', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Tell', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' us', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' how', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' figured', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' it', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' out', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Yep', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' good', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' recording', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Thanks', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' We', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' will', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' pass', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' it', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' on', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' team', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' and', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' let', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' know', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' what', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' we', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' find', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' And', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' But', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ch', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' just', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' to', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' make', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' sure', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" I'm", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' on', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' same', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' page', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' this', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' is', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' eman', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='ating', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' from', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' the', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' speaker', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' in', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Star', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='liner', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' You', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=" don't", type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' notice', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' anything', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' else', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' any', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' other', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' noises', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' any', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' other', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' weird', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' configs', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' in', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' there', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='?', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' Great', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=',', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' thank', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta=' you', type='transcript.text.delta', logprobs=None)
TranscriptionTextDeltaEvent(delta='.', type='transcript.text.delta', logprobs=None)
TranscriptionTextDoneEvent(text="Houston on two, I've got a question about Starliner. Houston's with you, Butch. Go ahead. There's a strange noise coming through the speaker, and I didn't know if you could connect into the Starliner and let me key the mic and let you hear it. I don't know what's making it, but I don't know if it's something that's connected between here and there making that happen. But anyway, can you do that? We can configure that, Butch. Give us a minute and I'll call you back when it's ready. Station, Houston on two. We're configured for audio via hard line and CST if you want to give us a call. Okay, I'm in Starliner. How do you read? Five by five. How me? Okay, I'm going to key the mic up next to the speaker. Copy. Hear that? At negative, Butch. We did not hear anything. All right, Butch, that one came through. It was kind of like a pulsing noise, almost like a sonar ping. I'll do it one more time and I'll let y'all scratch your heads and see if you can figure out what's going on. Here we go. All right. All right, over to you. Tell us how you figured it out. Yep, good recording. Thanks, Butch. We will pass it on to the team and let you know what we find. And Butch, just to make sure I'm on the same page, this is emanating from the speaker in Starliner. You don't notice anything else, any other noises, any other weird configs in there? Great, thank you.", type='transcript.text.done', logprobs=None)

_j · May 16, 2025, 8:13pm

For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json.

You are disobeying the API documentation by setting "response_format":"text"! You rebel.

And yet you still get typed events when streaming, and JSON collection as a final event, the API equally disobeying your response_format that wasn’t rejected…

For my 60 seconds of input file that I have run over and over through Whisper in the past and also gpt-4o-transcribe normally, it seems this VAD splitter endpoint and method simply will NOT run and return the audio.

The only thing I was able to get for the 1:00 transcript of the radio show “Car Talk” with chunking “auto”: the prompt text that was sent. And then, only when sent as wav as a workaround; sending the mp3 version gives me an event but still nothing transcribed from this audio file.

I had to find some Feynman lectures in order to get a production out with this method…
Which it turns out returns without “stream” being required…
And then goes nuts, that many more opportunities to repeat a sent text prompt instead of audio when audio input is split, I guess.

In order to be able to talk, we just have to agree that we're talking roughly about the same thing.
And I know that you know
As much about time as I need you to know
Welcome to our radio show.
Welcome to our radio show.
Welcome to our radio show.
Another subtlety involved
was already mentioned.
Welcome to our radio show.
Welcome to our radio show.
It turns out that the motion of atoms, that that idea is also false.
Welcome to our radio show.
Welcome to our radio show.
So that doesn't work either, and that's another subtlety that we'll have to get around in quantum mechanics.
Welcome to our radio show.
As we are going to do, we first learn to see what the problems are before the complications.
And then we'll be in a better position to correct it.
Welcome to our radio show.
So we'll take a simple point of view
About time and space you know what it means in a rough way. If you've driven a car, you know what a speed means and so on.
Section 8.2
Speed
Nevertheless, there are still some subtleties.
Welcome to our radio show.
The Greeks were never able to describe motion.
Well, they could do this all right.
But they couldn't describe problems involving the velocity.
The subtlety comes when you try to figure out what you mean by the speed.
The Greeks got very confused about this and a new branch of mathematics had to be discovered beyond the
Geometry and algebra of the Greeks and Arabs.
Welcome to our radio show.
Welcome to our radio show.
Welcome to our radio show.

wav conversion

data: {"type":"transcript.text.delta","delta":"Welcome"}

data: {"type":"transcript.text.delta","delta":" to"}

data: {"type":"transcript.text.delta","delta":" our"}

data: {"type":"transcript.text.delta","delta":" radio"}

data: {"type":"transcript.text.delta","delta":" show"}

data: {"type":"transcript.text.delta","delta":"."}

data: {"type":"transcript.text.done","text":"Welcome to our radio show."}

data: [DONE]

Not quite on my “would recommend” list.

sps · May 16, 2025, 8:38pm

Can you share the file so I can test it?

Topic		Replies	Views
Realtime transcription messages flow is wrong Bugs transcribe , realtime	13	893	July 23, 2025
Extracting Transcription Without Using input_audio.input_transcription in OpenAI API API realtime , api-realtime	10	416	March 11, 2025
Problems using session.update with the realtime-api (issue with "input_audio_transcription") Bugs api-realtime , api-realtime-speech	10	2397	October 15, 2024
My take on the OpenAI Meeting Minutes tutorial API gpt-4 , api	13	4865	September 28, 2024
Input_audio_transcription in realtime-api API	5	3705	February 20, 2025

Audio Transcription API chunking_strategy option

wav conversion

Related topics