Hosted `shell` Continuations Require Missing `shell_call_output`

I found a reproducible issue with the Responses API when using the hosted shell tool together with previous_response_id.

A first response can complete successfully with a hosted shell call, but a direct continuation using only previous_response_id fails with:

Error code: 400 - {'error': {'message': 'No tool output found for shell call call_...', 'type': 'invalid_request_error', 'param': 'input', 'code': None}}

The surprising part is that the first response itself is already marked completed, and the assistant message includes the shell result, but the response payload often does not include a shell_call_output item.

Why This Looks Like a Bug

The API appears to require a shell_call_output item for continuation, while also sometimes not returning that item in the first response payload.

This creates an inconsistent contract:

  1. The hosted shell tool executes server-side.
  2. The first response is completed.
  3. The assistant can describe the shell output in natural language.
  4. But continuing from that response via previous_response_id can fail because the server says the shell output is missing.

Environment

  • Model: gpt-5.2
  • Responses API
  • Hosted tool: shell
  • Tested with Python SDK versions:
    • openai 2.24.0
    • openai 2.26.0
  • Result was the same on both versions for the main repro.

Main Reproduction

Request 1

Create a response with hosted shell:

from openai import AsyncOpenAI

client = AsyncOpenAI()

resp1 = await client.responses.create(
    model="gpt-5.2",
    input="Use the shell tool once to run: printf first_turn. Then briefly report the output.",
    tools=[{"type": "shell", "environment": {"type": "container_auto"}}],
    reasoning={"effort": "medium", "summary": "detailed"},
    include=["reasoning.encrypted_content"],
    background=True,
)

Poll until terminal with client.responses.retrieve(resp1.id).

Observed Response 1 Shape

In multiple runs, the first completed response looked like this structurally:

{
  "status": "completed",
  "output": [
    {"type": "reasoning"},
    {"type": "shell_call", "status": "completed", "call_id": "call_..."},
    {"type": "reasoning"},
    {"type": "message"}
  ]
}

Notably absent:

  • no shell_call_output

Even though the assistant message already described the shell result.

Request 2

Now continue directly from the first response:

resp2 = await client.responses.create(
    model="gpt-5.2",
    previous_response_id=resp1.id,
    input="Now answer with exactly: second turn worked",
    tools=[{"type": "shell", "environment": {"type": "container_auto"}}],
    reasoning={"effort": "medium", "summary": "detailed"},
    include=["reasoning.encrypted_content"],
    background=True,
)

Actual Result

This fails immediately with:

Error code: 400 - {'error': {'message': 'No tool output found for shell call call_...', 'type': 'invalid_request_error', 'param': 'input', 'code': None}}

Expected Result

If the hosted shell call completed server-side and the first response is already terminal, then either:

  1. previous_response_id continuation should work without any extra client-side tool-output replay, or
  2. the first response should always include the required shell_call_output item so the client can replay it deterministically.

Verified Workaround

The continuation works if I manually inject a shell_call_output input item in the next request:

resp2 = await client.responses.create(
    model="gpt-5.2",
    previous_response_id=resp1.id,
    input=[
        {
            "type": "shell_call_output",
            "call_id": "call_from_resp1",
            "status": "completed",
            "output": [
                {
                    "stdout": "first_turn",
                    "stderr": "",
                    "outcome": {"type": "exit", "exit_code": 0},
                }
            ],
        },
        {
            "role": "user",
            "content": "Now answer with exactly: second turn worked.",
        },
    ],
    reasoning={"effort": "medium", "summary": "detailed"},
    background=True,
)

This succeeds.

Additional Observation: Inconsistent shell_call_output Presence

After introducing manual shell_call_output replay in a chain, later hosted shell responses sometimes started including shell_call_output items automatically in their returned output.

So there seem to be two inconsistent behaviors:

  1. Some completed hosted-shell responses return only shell_call + message.
  2. Other completed hosted-shell responses return both shell_call and shell_call_output.

That inconsistency makes it difficult to know whether the client is expected to replay tool output or whether the server should already be carrying it forward.

Includes Tested

I also tested all documented include values that are compatible with reasoning models:

[
    "file_search_call.results",
    "web_search_call.results",
    "web_search_call.action.sources",
    "message.input_image.image_url",
    "computer_call_output.output.image_url",
    "code_interpreter_call.outputs",
    "reasoning.encrypted_content",
]

This did not fix the issue.

Related But Separate Issue

I am also investigating a separate 400 error in a larger workflow that mentions a missing reasoning item.

At the moment, I have not minimized that second issue to a standalone hosted-shell repro. In my local tests, once I manually replay shell_call_output, multi-turn hosted-shell chains can continue successfully and retain memory of earlier shell outputs.

So this report is specifically about the reproducible hosted shell continuation problem where:

  • the first response completes,
  • but continuation via previous_response_id fails unless the client manually reconstructs and submits shell_call_output.

Minimal Expected Contract

For hosted shell plus previous_response_id, one of these should be true consistently:

  1. hosted shell execution state is fully preserved server-side, so direct continuation works, or
  2. the API always returns the exact shell_call_output item needed for replay in the next request.

Right now, neither appears reliable enough.

Local Artifacts Collected

I collected raw response payloads during testing, including:

  • initial first-response payloads without shell_call_output
  • all-compatible-includes payloads
  • successful manual-shell_call_output workaround payloads
  • longer manual replay chains

If useful, I can also provide raw JSON examples.

2 Likes

I suspect that previous_response_id is more bad idea technical debt that won’t be updated in function.

You’ll likely want to use and report against use of a conversations API ID.

Unlike a previous response ID (the mechanism at the release of the Responses API), conversation ID is made when you want them made, is mutable, and one “delete” to make them go away.

Response ID faults:

  • is actually a “chain” of response IDs, that have to be maintained forever without cleanup, or you will damage conversation history
  • has a very poor GET method, separate shapes and API calls for “sent input” and “AI response”.
  • will not generate an encrypted reasoning to ever truly get your call “exported”.
  • dangerous user data; to fulfill any promise of not retaining and potentially leaking user conversations, you’d have to maintain your own database of every ID or explore the chain completely, to then make a series of deletions.
  • non-budgetable: at context limit, you have your choice of 1) non-functioning chats returning errors, or 2) breaking context window caching with every new input (when old turns are discarded with “auto” truncation) - only after running up to the model’s maximum input context and a maximum bill per turn.

It could be immediately judged unacceptable when introduced a year ago.

Conversation ID is a bit more palatable as a hosted conversation state, except that it still requires you to use “store”: true for it to not silently fail, causing the same user data retention issues - but at least then you can immediately auto-delete response IDs that are completely unwanted.

2 Likes

Thank you, will give this a try!

2 Likes

I reproduced a closely related variant locally using the Conversations API instead of previous_response_id, and it shows the same underlying failure mode.

What I tried:

  1. Created a new conversation with client.conversations.create()
  2. Sent a first background response in that conversation using hosted shell with container_auto.
  3. Waited until the first response reached completed.
  4. Sent a second response using the same conversation ID, not previous_response_id.

Observed behavior:

  • The first response completed successfully.
  • Its returned output contained:
    • a user message
    • reasoning items
    • a shell_call
    • a final assistant message describing the shell result
  • But the persisted conversation items did not include a corresponding shell_call_output.
  • The second response then failed with the same class of error:
    No tool output found for shell call call_...

So this does not seem limited to previous_response_id.
I saw the same issue when continuing from stored conversation state.

What fixed it:

  • Before sending the second turn, I deleted the shell-related items from the conversation:
    • shell_call
    • shell_call_output
    • local_shell_call
    • local_shell_call_output
  • After removing the shell tool artifacts, the second turn succeeded and the conversation still retained the plain user/assistant transcript.

Interpretation:

  • The hosted shell execution clearly completed.
  • The assistant message had the shell result.
  • But the conversation state was left in a form that the next turn could not replay.
  • That strongly suggests an API/state persistence bug or inconsistent contract around hosted shell tool continuation, rather than simple client misuse.

If useful, I can also share the exact repro script and the conversation item shapes I observed before and after cleanup.

1 Like