Yet another gpt-image-1 pricing issue

Hi!

Note: I’ve read most of the threads about gpt-image-1 pricing, but I’m still stuck reconciling my usage with the exported billing report.

I’m using the Responses API to generate an image and classify results. Here’s the Python call:

response = await self.client.responses.parse(
    model='gpt-4.1',
    input=[
        {"role": "system", 
        "content": dedent(self.system_prompt)},
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": user_prompt},
                {"type": "input_image", "image_url": image_url},
            ],
        },
    ],
    tools=[{
        "type": "image_generation",
        "background": "transparent",
        "input_fidelity": "low",
        "quality": "medium",
        "output_format": "png",
        "moderation": "low",
        "size": "1024x1024"
    }],
    text_format=ClassificationOutput
)

My local counts:

  1. system_prompt+user_prompt: 891 tokens (by tiktoken, o200k_base)
  2. image 1024x768: 129*2+65=323 input tokens to gpt-image-1 (low input fidelity, two tiles)
  3. pydantic json ClassificationOutput: ~300 tokens (by tiktoken, o200k_base)

Billing export (corresponding entries)

num_model_requests model input_tokens output_tokens batch service_tier
1.0 gpt-image-1-2025-04-23 470.0 1056.0 FALSE default
1.0 gpt-4.1-2025-04-14 4490.0 486.0 FALSE default

Questions

  1. Am I correct that this single call triggers both an image generation (gpt-image-1) and then a separate classification pass on gpt-4.1?

  2. How should I derive the input token counts for each model?

    • For image generation, I expected total_tokens = input_image_tokens + input_text_tokens, but my text tokens alone are higher than 470.

    • For the gpt-4.1 classification, even if the generated image is being passed back at high quality, I would expect much less than 4490 input tokens.

What am I misunderstanding or doing wrong when estimating these numbers? Any guidance on how the Responses API splits/attributes token usage between gpt-4.1 and gpt-image-1—and how to reproduce the billing counts—would be much appreciated.