Function calling with recursive variant in an `anyOf` erroneously collapses to the simplest schema-valid variant

I’ve noticed a degenerate pattern in my function calling, which I will illustrate with a minimal example. I’ve found that when I supply the model with an anyOf, where the model might need to select one of multiple complex variants, it will consistently collapse to the most simple variant that satisfies the schema, even if logically incoherent.

To illustrate with a simple toy example: suppose an agent tool call “insert_content” which aims to insert structured content into a text document. There are two possible types of content that the agent can insert – a paragraph (a simple text field) or a list (a container which can contain a list of content, either paragraphs or other lists)*. This can be represented as an array of Content. This schema for this tool is shown here.

{
  "name": "insert_content",
  "description": "Insert content into a document",
  "strict": true,
  "parameters": {
    "$defs": {
      "BulletList": {
        "additionalProperties": false,
        "description": "An unordered bullet list",
        "properties": {
          "items": {
            "description": "The list items",
            "items": {
              "$ref": "#/$defs/ListItem"
            },
            "type": "array"
          },
          "kind": {
            "const": "bullet_list",
            "type": "string"
          }
        },
        "required": [
          "items",
          "kind"
        ],
        "type": "object"
      },
      "Content": {
        "anyOf": [
          {
            "$ref": "#/$defs/Paragraph"
          },
          {
            "$ref": "#/$defs/BulletList"
          }
        ],
        "description": "A type of content that can be inserted"
      },
      "ListItem": {
        "additionalProperties": false,
        "description": "A single item in a bullet list",
        "properties": {
          "content": {
            "description": "Content inside the list item. Usually a paragraph, but can contain nested lists.",
            "items": {
              "$ref": "#/$defs/Content"
            },
            "type": "array"
          }
        },
        "required": [
          "content"
        ],
        "type": "object"
      },
      "Paragraph": {
        "additionalProperties": false,
        "description": "A paragraph containing text with optional formatting",
        "properties": {
          "kind": {
            "const": "paragraph",
            "type": "string"
          },
          "text": {
            "description": "Text with inline markup",
            "type": "string"
          }
        },
        "required": [
          "kind",
          "text"
        ],
        "type": "object"
      }
    },
    "additionalProperties": false,
    "properties": {
      "content": {
        "description": "Array of all content to insert",
        "items": {
          "$ref": "#/$defs/Content"
        },
        "type": "array"
      }
    },
    "required": [
      "content"
    ],
    "title": "ProposeContentInsertion",
    "type": "object"
  }
}

No matter what I request (“Insert a short bullet-list study guide about OpenAI”), the tool call will always exclusively emit Paragraph, and never a BulletList. Even if I follow-up with “You chose to emit paragraphs instead of a bullet list. Strictly insert a bullet list using BulletList content, not raw top-level paragraphs” it will still decline to ever insert a BulletList. I am wondering how I can reasonably get the behavior that I expect.

With the above schema, this issue should readily reproduce in the Chat sandbox with the Responses API, verified with GPT-4.1 and newer models.

*In this toy example, it would be simpler for the tool to simply output something like Markdown and do sanitization downstream, but in my actual use case the possible outputs are more complex and require a structured output.