[REALTIME API] - FEEDBACK - We Built a Star Trek Medical Computer on the Realtime API, It Works 30% of the Time

TL;DR: We’ve spent ~6 months building something that, when it fires on all cylinders, makes doctors stop mid-sentence and say “wait… it just did that?” A passive AI medical scribe that listens, extracts, and structures clinical data in real time — no typing, no clicking, no dictation. The doctor just… talks to their patient, the ai is there to surface info and extract structured data.

  • But the Realtime API’s instability is the wall between “incredible demo” and “deployed product.”

What We Built

A HIPAA-compliant passive AI assistant that sits in the background of a doctor-patient encounter. It listens. It extracts structured medical data through tool calls — orders, procedures, diagnoses, medications, clinical observations… whatever the doctor needs, as it configurable by them. It can answer questions about the patient’s chart mid-conversation. It can offload complex clinical reasoning to deeper models on the fly.

  • tool_choice: "required"
  • output_modalities: ["text"]
  • No audio output.
  • No chit-chat.
  • Tools only.
    • A continue_waiting tool handles cycles with nothing to extract.

When it works — and I cannot stress this enough — it is magic. A cardiologist walks into a room, taps a button, has a 20-minute conversation about LVH and medication management, and walks out with structured, organized clinical data ready for the chart. We’ve had sessions where the AI caught medication discrepancies the doctor hadn’t noticed. We’ve had it surface relevant lab trends mid-conversation before the doctor even asked.

That’s the 30%. Here’s the other 70%.


Issue 1: Silent Session Initialization Failures

~1 in 3-4 sessions. Connection opens, green lights everywhere, but the API never processes audio. No errors, no disconnects — just silence. The only way to detect it is the absence of expected events. We’ve built watchdog timers and automatic retry logic, but even with retries it sometimes just won’t start.

You can’t ask a doctor to “try again” while a patient is sitting in front of them, waiting for all these system to initialize, etc. etc.


Issue 2: Tool Selection Death Spirals

Once the AI starts calling continue_waiting, it frequently gets stuck — 15+ consecutive calls with zero extractions during active medical conversation with clear, extractable content. Corrective injections often make it worse; the AI over-indexes on the reminder rather than returning to its job.

We’ve iterated on this extensively. The tool selection behavior is fundamentally inconsistent.


Issue 3: Text Responses Despite tool_choice: "required"

The AI periodically generates conversational text (“I understand, let me know…”) despite tool_choice: "required". Wastes processing cycles and triggers cascading correction loops that feed into Issue 2.

This seems like a straightforward bug.


Issue 4: Quality Degradation Over Session Length

Beyond 10-15 minutes, extraction accuracy drops noticeably. More idle calls, missed data, less precise tool arguments. Medical encounters run 20-40 minutes routinely. This is a critical gap for any real-world clinical deployment.


Why This Matters

We’re not building a toy. This is active, HIPAA-compliant clinical infrastructure being tested with real patients, real encounters, real cardiologists. The workflow is validated. The doctors who’ve experienced it working are asking us when they can have it every day.

We have the architecture. We have the clinical integration. We have the audio pipeline, the prompts, the data extraction, the chart integration, the reasoning offloads. Everything around the Realtime API works. The Realtime API itself is the bottleneck.

We’re :pinching_hand:this close to deploying something that fundamentally changes how clinical documentation works. We just need the engine to be as reliable as the machine we built around it.

Happy to share session logs or debug traces if useful to the engineering team.

4 Likes

I feel your pain on this, having seen the experience grow worse as time progresses. There is only a 32K input token limit. Have you experimented with calling conversation.item.delete on old items? perhaps with a separate model flagging what can be deleted, and perhaps even introducing a consolidation of several items and deleting them? That is on my TODO list to tackle this problem

1 Like

Hey @multitechvisions, appreciate you for bringing this to our attention. We’re going to dig into this and see what’s driving it. We’ll share an update once we learn more.

3 Likes

Hi

I am currently working on a medical program and may have a solution to your API timeout isue

It’s not a timeout issue: it’s that the model just… doesn’t work sometimes. It’s unstable and unreliable.

Does it seem like it loses track of the overall conversation, "hallucinating "?
If that is the case,I may have a solution but it depends on what Hardware + Software ability :thinking:

1 Like

No we engineered that away at eons ago. Sure there’s the occasional thing here or there, it’s probably nearly impossible to not have that happen given the nature of these stochastic models, but with enough checks and balances you can catch everything and smooth it out.

No the problem is literally that the model just won’t respond sometimes, won’t behave like it’s supposed to.

  • We get server events back
  • Everything appears to be working
  • And even when we force the model to respond it will respond (and even include content from the conversation that it’s listening to)

But it won’t call tools or respond normally, based on verbal cues.

If I rerun the audio through the system again, it’ll work that time. And if I try a third time maybe it’ll work that time, and maybe it won’t work the fourth time, and then it will work the fifth time, and then maybe the sixth time it will…

We’ve been dealing with this for a long time, engineered refinement after refinement to try and solve this problem - and it’s literally just down to the model itself.

When there is no difference between session A and session B - literally token for token exactly the same inputs for both text and audio - yet one works and one doesn’t… It has to be the model.

Hopefully the dev team can figure it out!

1 Like

They are gaslighting you in this thread. The model is routinely degraded and customers are routinely throttled. That’s what you’re experiencing, and OpenAI know it. It’s not hallucinating, that’s an antiquated way of thinking about this, and probably a planned scapegoat by OpenAI (and Anthropic).

My experience was not dissimilar: two weeks of absolute magic with codex and 5.4. Then I made the mistake of paying $200 for the Pro plan. Quite literally, as soon as I did that, performance of all models became so bad as to be unusable. A $200/mo cortisol factory.

I’m considering doing a chargeback.

Hi,
Pragmatic speaking,

I haven’t gone through the whole thread yet so forgive me if this has already been mentioned…

Don’t use the Ai to record like this. Use a simple digital recorder so you have a copy that can be re-inserted if things go south.

This alone seems like it would solve most of your problems until the Ai fully trains up on your workflow.

The tech is just stable enough that you’ll get the results you need with a few nudges… and you can split the digital file into smaller chunks, produce context and then insert the next chunk to produce context, repeat until the session is fully conceptualized.

Hi,

Thanks for sharing the detailed feedback on your Realtime API implementation. Have you had a chance to test with our latest Realtime model, gpt-realtime-1.5?

Model details: https://developers.openai.com/api/docs/models/gpt-realtime-1.5

To help our team investigate the silent initialization, tool-calling, and longer-session reliability issues you described, please contact support@openai.com with as much detail as possible, including:

  • 3-5 example sess_... session IDs for failing sessions
  • Exact timestamps and timezone for each failure
  • The model name you were using
  • Your session.update and response.create payloads, with secrets removed
  • Client-side WebSocket event logs around the failure
  • Any relevant tool-call logs, especially around continue_waiting
  • Whether the same audio succeeds if replayed through a new session

Those details will let Support route the issue with enough evidence for deeper investigation.