TL;DR: We’ve spent ~6 months building something that, when it fires on all cylinders, makes doctors stop mid-sentence and say “wait… it just did that?” A passive AI medical scribe that listens, extracts, and structures clinical data in real time — no typing, no clicking, no dictation. The doctor just… talks to their patient, the ai is there to surface info and extract structured data.
- But the Realtime API’s instability is the wall between “incredible demo” and “deployed product.”
What We Built
A HIPAA-compliant passive AI assistant that sits in the background of a doctor-patient encounter. It listens. It extracts structured medical data through tool calls — orders, procedures, diagnoses, medications, clinical observations… whatever the doctor needs, as it configurable by them. It can answer questions about the patient’s chart mid-conversation. It can offload complex clinical reasoning to deeper models on the fly.
tool_choice: "required"output_modalities: ["text"]- No audio output.
- No chit-chat.
- Tools only.
- A
continue_waitingtool handles cycles with nothing to extract.
- A
When it works — and I cannot stress this enough — it is magic. A cardiologist walks into a room, taps a button, has a 20-minute conversation about LVH and medication management, and walks out with structured, organized clinical data ready for the chart. We’ve had sessions where the AI caught medication discrepancies the doctor hadn’t noticed. We’ve had it surface relevant lab trends mid-conversation before the doctor even asked.
That’s the 30%. Here’s the other 70%.
Issue 1: Silent Session Initialization Failures
~1 in 3-4 sessions. Connection opens, green lights everywhere, but the API never processes audio. No errors, no disconnects — just silence. The only way to detect it is the absence of expected events. We’ve built watchdog timers and automatic retry logic, but even with retries it sometimes just won’t start.
You can’t ask a doctor to “try again” while a patient is sitting in front of them, waiting for all these system to initialize, etc. etc.
Issue 2: Tool Selection Death Spirals
Once the AI starts calling continue_waiting, it frequently gets stuck — 15+ consecutive calls with zero extractions during active medical conversation with clear, extractable content. Corrective injections often make it worse; the AI over-indexes on the reminder rather than returning to its job.
We’ve iterated on this extensively. The tool selection behavior is fundamentally inconsistent.
Issue 3: Text Responses Despite tool_choice: "required"
The AI periodically generates conversational text (“I understand, let me know…”) despite tool_choice: "required". Wastes processing cycles and triggers cascading correction loops that feed into Issue 2.
This seems like a straightforward bug.
Issue 4: Quality Degradation Over Session Length
Beyond 10-15 minutes, extraction accuracy drops noticeably. More idle calls, missed data, less precise tool arguments. Medical encounters run 20-40 minutes routinely. This is a critical gap for any real-world clinical deployment.
Why This Matters
We’re not building a toy. This is active, HIPAA-compliant clinical infrastructure being tested with real patients, real encounters, real cardiologists. The workflow is validated. The doctors who’ve experienced it working are asking us when they can have it every day.
We have the architecture. We have the clinical integration. We have the audio pipeline, the prompts, the data extraction, the chart integration, the reasoning offloads. Everything around the Realtime API works. The Realtime API itself is the bottleneck.
We’re
this close to deploying something that fundamentally changes how clinical documentation works. We just need the engine to be as reliable as the machine we built around it.
Happy to share session logs or debug traces if useful to the engineering team.