Batch API with o3-deep-research spawning duplicates

Looking for help understanding what we’re seeing on the Batch API with o3-deep-research-2025-06-26.

What we submitted

  • One day of legitimate state-research work across 7 different state slugs

  • Each state: 4 separate POST /v1/batches, each batch containing exactly 1 line targeting /v1/responses

  • Total legitimate request lines submitted across the whole day: 28 (one slug had only 3, so 27)

  • Our server logs the route invocation each time and confirms 28 distinct submissions, no client-side retries

What we observed in the OpenAI dashboard

For the most recently-affected slug alone, the Logs → Responses view shows 629 entries today, all with the JSONL body our 4 batches for that slug submitted (same prompt text, same max_output_tokens: 100000, same

max_tool_calls: 30, same tool config, background: false, metadata: {}). About 80+ of them are full completions with returned content (the rest are from 429s after we hit the per-model TPD cap).

So 4 batch lines became ~80 billed Deep Research completions, plus several hundred additional /v1/responses log entries that didn’t return content. The same pattern played out earlier in the day across the other 6 state

slugs.

Today’s spend on the model: $47.12 — consistent with dozens of completed Deep Research runs, not 28.

What we have ruled out from our side

  • We have exactly two /v1/responses submission paths in our code: a direct background: true call and the batch JSONL above. Every one of the mystery resp_… objects has background: false, so the direct path is not the

source.

  • GET /v1/batches?limit=100 returns 27 batches for the day total. No extra batches on the affected slug. No older Wyoming batches. So fan-out is not “duplicate submissions we forgot about”.

  • Server-side logs show exactly one route invocation per slug for the day. No client-side retry loop.

  • Webhooks and crons in the codebase are read-only against OpenAI.

Other oddities

  • Each batch’s request_counts reads {total: 1, completed: 0, failed: 0}, despite the dashboard showing many /v1/responses executions associated with that batch’s JSONL body. So the counters don’t reflect what actually ran.

  • output_file_id and error_file_id are both null on all the batches in question, so the work isn’t surfacing through normal batch output channels even though we’re being billed for it.

  • 4 batches for one slug have been stuck in cancelling for 11+ hours after POST /v1/batches/{id}/cancel returned HTTP 200. They never transitioned to cancelled. The other 23 batches show status completed despite having no

output file.

  • The spawned resp_… objects carry no link back to a parent batch — metadata: {} on every one of them — so there’s no way from a response in the logs to figure out which batch produced it.

Questions for the community / OpenAI staff

1. Under what circumstances does a single batch line with total: 1 produce many /v1/responses executions? Is there an internal retry / fan-out policy on the batch worker?

2. Why do batch request_counts not reflect the actual number of /v1/responses executions performed for that batch?

3. Is it expected that a completed batch with no output_file_id still ran billable executions whose outputs were not delivered through the batch API?

4. Should per-line /v1/responses objects inherit the batch envelope’s metadata so they’re attributable? Right now the response object stands alone with no parent reference, which makes incidents like this very hard to

diagnose.

Substantial new evidence after rotating the API key. The bug reproduces cleanly on the new key, with no 429s involved, and is much more serious than my original report suggested. The previous “429 retry loop” framing was

wrong. Below is what just happened.

Setup

  • Old sk-proj-…0xIW2XgA key rotated this afternoon, all spend stopped briefly.

  • On the new key I submitted exactly one batch to verify the fix: batch_6a0334d99ce481908a3c9cc7e9a4399c, slug wyoming, call 1, 1 line, endpoint: /v1/responses, o3-deep-research-2025-06-26.

  • Fresh quota window. No 429s. No client retries.

What happened

That single batch line produced 3 separate, successful, billed Deep Research completions in under 30 minutes.

All three have status: completed, error: null, incomplete_details: null. All three have metadata: {} because the batch envelope’s metadata does not propagate into spawned response objects.

The smoking gun is the dispatch timing

  • Dispatch #2 was created at 14:23:11, 3 minutes 58 seconds before Dispatch #1 completed at 14:27:13. So #2 was not dispatched in response to #1 failing or timing out — #1 was still running successfully.

  • Dispatch #3 was created at 14:29:11, 1 minute 58 seconds after Dispatch #1 had already completed and presumably reported back. So #3 was not dispatched because the batch worker thought #1 was missing.

This is spontaneous duplication of a successful, in-flight or already-completed batch line.

Billing impact for this one batch

  • Token cost (50% batch discount applied): $2.75

  • Tool cost (107 web searches × $0.025, no discount): $2.68

  • Total billed: ~$5.43

  • Expected cost with a single dispatch: ~$1.81

  • ~3x overspend on a single batch line

Multiplied across the 27 batches I submitted earlier today (which were also fanning out, then masked by a 429 retry loop on top), this explains today’s full $47.12 spend without any bug in my code.

Cancel behavior is still broken too

  • Cancel issued at 14:50:42 UTC, req_6d6d39b89ebf4f388d282ba21791081d, returned HTTP 200 status: cancelling.

  • Four more Arizona batches cancelled in the same minute (req_6e2b9f00c36d4746988bb300533083d2, req_a6de1175c3ed42dca7108a83ddf84c36, req_0cece57e96ac446ab6115df8418941a4, req_4823a5e3065e4261a4d59aa4f3e901be).

  • All 5 still in cancelling with no cancelled_at more than 20 minutes later. Same stuck-cancel pattern as the morning batches.

Other persistent oddities

  • The Wyoming batch’s request_counts reads {total: 1, completed: 0, failed: 0} despite 3 fully-billed completions executing under it. Counters are not tracking reality.

  • output_file_id is null. We were billed ~$5.43 for work whose outputs are not delivered through the documented batch retrieval path. To recover the markdown we have to pull each resp_… individually by ID.

  • All three resp_… objects show background: false and metadata: {}, with no field linking them back to the parent batch. The connection only exists in OpenAI’s internal logs.

Requests

1. Engineering escalation: a single batch line producing multiple billed completions, on a clean key with no 429s, is a serious correctness bug. Please escalate to the Batch API team.

2. Force-cancel these 5 batches: batch_6a0334d99ce481908a3c9cc7e9a4399c, batch_6a033b6ac91c81909999c3a124aafc4b, batch_6a033b6ab20c8190aa2f6b5311d1a0ad, batch_6a033b6abfbc8190bb2803d7e629713c,

batch_6a033b6a12c8819096db4a2036e5e5cd. They are stuck in cancelling.

3. Credit today’s o3-deep-research-2025-06-26 spend down to the legitimate single-dispatch cost of the 28 lines submitted. Current spend is $47.12; legitimate cost is at most ~$50 if every line had succeeded once, but in

reality many lines produced no usable output through the batch path, so the appropriate refund covers any execution beyond the first per submitted line.

4. Bug to file: the spawned /v1/responses objects should inherit the batch envelope’s metadata. Right now a customer hitting this bug has no way to attribute spawned responses to their batch, which makes triage take hours.