Introducing gpt-realtime and Realtime API updates for production voice agents

This is great. Couple of things

  1. The preview of Marin voice in the playground is off (she says Hi dare to start), especially weird b/c in TTS it says hi there properly.
  2. More importantly - will there be a mini version of this? The performance of the previous model when the prompt/context was near or over the limit was erratic. The mini had a larger window and was thus attractive.
  3. will there be future adjustments to allow greater flexibility with temperature.

The new guide looks great, look forward to testing it out in detail.

1 Like

In 2022, as a GPT user, I couldn’t have imagined how much technology would evolve. Currently, for example, I need to become a programmer to understand and improve my experience. If anyone knows of courses or something similar that I can learn from, please feel free to let me know.

Hi do you able to connect and completed sucessfull conversation using SIP as endpoint ? I am able to connect via asterisk and conversation when i use ARI approach but unable to do so using the new SIP interface though suggested webhook server i see successfull connection and message but always failing with 406 Code. Thanks

Here’s a brief course for you - in AI prompting:

Develop a beginner-friendly coursework plan in backend development (Python or Node.js) and a web client/app, with a focused goal of enabling learners to build full-stack AI products using OpenAI APIs. Model the teaching path on exercises that use AI-assisted Socratic guidance. By the end, learners should be proficient in:

  • Backend API integration: consuming third-party APIs, proxying API-to-API data, implementing AI-initiated function/tool calls, and performing input validation and safety checks.
  • Databases: modeling users and billing; managing AI conversations and end-user customization; and handling policy/trust compliance classification.
  • Web sessions: session management, logins, client interactions, and authentication.
  • Audio and real-time protocols: WebSockets, WebRTC, SIP, relaying, telephony, digital-audio fundamentals, and device interactions.
  • Cloud and DevOps: cloud services and workers; hosting platforms; deployment; configuration management; and version control.
First AI prompt: improve the AI instruction to make the course

You are an expert full‑stack engineer, multimodal AI practitioner, and curriculum designer. Create a complete, self‑contained course package that turns a beginning hobby programmer into a competent multimodal AI product developer who can design, build, deploy, and maintain real‑world full‑stack AI apps using OpenAI APIs.

Audience and assumptions

  • Audience: beginners with basic programming (variables, loops, functions) but little web/AI experience.

  • Do not ask the user clarifying questions. Choose sensible defaults and document assumptions.

  • Provide parallel backend tracks: Python (FastAPI) and Node.js (Express or Nest). Frontend: React/Next.js.

  • Local-first development; cross-platform (Windows, macOS, Linux); no secrets in code; use .env.

Deliverables to produce

  • Program overview: goals, learner persona, prerequisites, outcomes, weekly pacing options (4-, 8-, 12-week), estimated time per module.

  • Syllabus and roadmap with dependency graph of skills.

  • Module plans with: learning objectives, short readings, key concepts, step-by-step labs, Socratic prompts, quizzes, and reflection.

  • Hands-on code labs for each topic with starter and solution code, test cases, and troubleshooting notes.

  • Three capstone projects with specs, acceptance criteria, rubrics, and reference implementations.

  • Assessment plan: rubrics, checklists, auto-graded tests, and demonstration benchmarks.

  • Setup guides: Windows/macOS/Linux; Python and Node toolchains; Docker; database setup; device/audio setup.

  • Reference appendices: glossary, API error guide, security checklist, cost/budget worksheet, FAQ.

  • Starter repo layout (tree), environment templates, and scripts for running, testing, linting, and deployment.

Required technical coverage

  • OpenAI and multimodal

  • Text, vision, embeddings/RAG, image generation, speech-to-text, text-to-speech, and realtime/streaming assistants.

  • Tool/function calling and structured output; input validation and safety inspection; prompt and tool design patterns.

  • Conversation state, memory strategies, token accounting, and rate limiting; observability and evaluation.

  • Backend engineering

  • HTTP fundamentals, REST/JSON, WebSockets; async programming; proxying API-to-API data; background workers and queues.

  • Implementing AI-called functions; schema design for tool inputs/outputs; retry, backoff, idempotency.

  • Security: authN/Z (sessions, JWT, OAuth), CSRF/CORS, secrets management, logging/redaction, abuse monitoring.

  • Databases and state

  • Relational modeling (users, roles, billing/usage metering), conversation and message stores, audit logs.

  • Vector search (pgvector or equivalent) and RAG; migrations; indexes; performance and cost considerations.

  • Policy/trust compliance classification pipelines and storage of decisions.

  • Web client/app

  • Login, session handling, secure cookies, client-server data flow, streaming UI, file/image/audio capture and upload.

  • Accessibility, i18n, offline/error handling, and performance budgets.

  • Audio and realtime

  • Digital audio fundamentals (sampling, bit rate, latency), microphone/speaker I/O, VAD, codecs.

  • WebSockets, WebRTC data and media channels; server relaying; SIP concepts and telephony integration; safety and consent.

  • Cloud/DevOps

  • Hosting options (Vercel/Cloudflare/Fly/Render), containers, CI/CD, config management, feature flags.

  • Background workers, schedulers, and serverless functions; monitoring, logs/metrics/traces; alerting.

  • Version control workflows (Git), code reviews, branching, and release management.

  • Cost management: API usage estimation, budgeting, and throttling.

Pedagogy and scaffolding

  • Socratic teaching: embed tutor-style questions and hints at each step; encourage self-explanation and reflection.

  • Progressive projects that culminate in production-ready features; optional “stretch” challenges for advanced learners.

  • Include common pitfalls, debugging tips, and “why this design” tradeoff notes.

  • Accessibility and ethics: PII handling, consent, privacy by design, content safety, policy alignment, and red-teaming basics.

Projects to include

  • Retrieval-augmented AI assistant with tool calling (e.g., support agent) and admin dashboard for policy/trust review.

  • Realtime voice assistant: low-latency mic capture, TTS/STT, function calls, WebRTC/WebSocket streaming, device controls.

  • Multimodal workflow app (e.g., image or document intake → classification → extraction → human-in-the-loop review → export).

  • Optional telephony variant with SIP/bridging, with strict safety, consent, and testing in sandbox.

Implementation requirements

  • Provide parallel code snippets for Python and Node where applicable; React/Next.js frontend examples.

  • Database: Postgres for production with migration tools (Prisma or Alembic/SQLModel), SQLite for local dev.

  • Vector store: pgvector or lightweight alternative; show embedding pipelines and chunking strategies.

  • Workers: Celery/RQ (Python) or BullMQ/worker threads (Node); job schemas; retries.

  • Realtime: sample WebSocket and WebRTC servers; TURN/STUN guidance; bandwidth and NAT considerations.

  • Testing: unit, integration, and E2E tests; mocked OpenAI calls; load and latency tests for realtime.

  • Security: input validation (schema), output filtering, rate limits, audit logging, key rotation patterns.

  • Deployment: Dockerfiles, compose, environment matrices; one-click deploy scripts; rollbacks; seed data.

  • Cost and reliability: exponential backoff, circuit breakers, budget guards, and graceful degradation.

Format and presentation

  • Use clear section headings, numbered steps, and concise explanations.

  • Provide code blocks labeled with language, file paths, and commands; include comments and docstrings.

  • Include time estimates per lab, acceptance criteria per task, and “definition of done” checklists.

  • Offer alternative paths: Python-only, Node-only, or mixed; note where choices diverge.

Constraints

  • Keep external dependencies minimal and mainstream; prefer well-documented libraries.

  • No hard-coded secrets; demonstrate .env usage and secret rotation.

  • Avoid region-locked services where possible; provide fallbacks.

  • Do not assume paid third-party services beyond OpenAI and a common relational DB; if used, provide free-tier alternatives.

Quality bar and verification

  • Every lab should run locally with copy-pasteable commands.

  • All sample code should pass included tests and linting.

  • Provide a short self-evaluation checklist mapping outcomes to activities and evidence of competency.

Output everything needed for a learner to complete the program end-to-end without further prompts: narrative, exercises, code, tests, templates, deployment steps, and rubrics.

Hello,

I recently launched a SaaS in France with about ten clients, and since the release of GPT-Realtime my AI Agents no longer behave as before.

I’m facing a serious issue with repetitive answers, it’s horrible. The agent doesn’t understand simple YES or NO responses from users anymore. Instead, it keeps repeating the same sentences over and over without actually processing the input.

My code has already been reviewed by senior developers, so the issue doesn’t come from the implementation. Even with a very simple script, the agents still repeat themselves.

Here’s an example of a conversation:

Agent: Hello, how can I help you?
Caller: I’m calling to report a sewer leak.
Agent: Hello, how can I help you?
Caller: I’d like to report a leak.
Agent: Hello, how can I help you?
Caller: I’m calling to report a sewer leak.
Agent: Thank you for your report. Could you please specify the exact location of the sewer leak?

Has anyone else experienced this? Any ideas on what might be causing the problem?

if you want to talk to the docs via gpt 5, try link below - you won’t find a more comprehensive custom GPT → ChatGPT - OpenAPI Ultimate Docs Guide GPT

  1. provide a transcript model (session.update.session.audio.input.transcription.model=gpt-4o-mini-transcribe) and look for conversation.item.input_audio_transcription.delta messages. This will tell you if the model is receiving noise or if it is getting good audio signal, regardless of whether the realtime model understands it. Based on results from this I think there’s a legit bug but I its hard to repro

  2. I have found two things that seem to help this a lot (but don’t eradicate it). (1) accumulate one or two seconds of input audio BEFORE you send the first input_audio_buffer.append event. You can stream small chunks after that but make the first input_audio_buffer.append on the websocket longer. OR alternatively you can (2) send two seconds of silence before your first audio chunk.

ever since the launch, I’m getting this error with my app “You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors."

I’ve only used a few cents and have $5.00 loaded up. Have tried with several names of the realtime models and they all do this. What’s really weird is that I can get it to work just fine in the playground with the same user/api key. Any ideas what might be happening?

Its free version available?

Is this an app build with the realtime-preview or is it based on the new model/API?

Is there a timeline for when the GA realtime API will be available in Azure. Looks like the model is there now but the API is not updated. Thanks

It appears to be deployable now, but Cedar and Marin voices not available.

Receiving this response when trying to get available voices:

curl ‘https://{resource_name}.openai.azure.com/openai/realtimeapi/sessions?api-version=2025-04-01-preview’
-H ‘accept: /
-H ‘accept-language: en-US,en;q=0.9,az;q=0.8’
-H ‘api-key: {your_api_key}’
-H ‘content-type: application/json’
–data-raw ‘{“model”:“gpt-realtime”,“voice”:“sdsdf”}’

{
“error”: {
“message”: “Invalid value: ‘sdsdf’. Supported values are: ‘alloy’, ‘ash’, ‘ballad’, ‘coral’, ‘echo’, ‘sage’, ‘shimmer’, and ‘verse’.”,
“type”: “invalid_request_error”,
“param”: “voice”,
“code”: “invalid_value”
}
}

Directly requesting Marin or Cedar results in:

{
“error”: {
“message”: “Voice marin is not available for your organization.”,
“type”: “invalid_request_error”,
“param”: “voice”,
“code”: “invalid_value”
}
}

Anyone know if these coming to Azure OpenAI?

The challenge with Azure is that while they do have the GA model, the API version is still preview. As a result, all the new constructs, such as updated session object structure, new voices etc., are not working. They are getting rejected at API validation step itself.

Hence I was curious when they will update their API version.

The paperclip appeared in the interface after a while, but it is not possible to select and add a text file in txt, doc, or pdf format as context. Does anyone understand why?