Invalid 'input_audio_transcription.prompt': string too long

Why the prompt limit for gpt-4o-transcribe realtime api is only 1024 ?
the error is thrown when I create a transcription sessions /v1/realtime/transcription_sessions

  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",
    "language": null,
    "prompt": ""
  },

I can only put a 1024 chars string, much lower than 16,000 context window supported by gpt-4o-transcribe

any reason why ?

The reason why is to protect you - from yourself.

A prompt is not a place for commands or behaviors.

It is for the whisper series of models, and is meant to be a lead-up text that is not reproduced, but is the immediate transcript before. This gives contextual information that enhances the output text production upon continuation.

I expect that it is similarly containered with the purpose of some text stated when given to gpt-4o. And thus doesn’t work reliably - the same way that you say “continue this” to the model, 50/50 chance it doesn’t continue.

@_j thank you for the reply.
My next question is, do realtime gpt-4o-transcribe re-use existing audio or transcription result? from my test looks like it is not? I’ve cases where it transcript the audio to a totally different language

Every API call you make will be independent, a new instance, stateless. Were it not, imagine the confusion when I send 100 API calls in parallel.

Even in producing a “chat” with an entity, to make every question and every interaction not appear to be the fresh start it is, we must sent back previous turns of conversation as a pattern that appears to be continuing.

Thus prompt for audio-to-speech models is just that: a preloading of observed speech-as-text that the AI continues it’s output upon.

Multimodal gpt-4o is different in its operation: it observes an entire context window that has been placed, with an attention mechanism, and with intelligence, completes, especially when prompted or fine-tuned. So although this model is undocumented except by name, we can assume it similar to “you repeat this back, but as generated text”, with confusion still on the table.