On the Responses API endpoint, you must employ a “chat” AI model, trained on having a conversation and outputting turns as “assistant” and following your instructions.
Then you enable tools you want the AI model to be able to use.
Where the “Image generation tool” documentation page says:
…it is not referring to the names of the image generation models within the tool; the text is referring to the list of recommended AI models that appeared immediately before, such as gpt-4o or gpt-5. They are “chat” models and also models capable of calling internal tools on the Responses API. The text is saying that: you have a more limited selection of AI models available if wanting “chat, with image creation” (behaving like ChatGPT does for free users..). (The interloping model names of image models should be removed from that part of the documentation.)
If you go to the API prompts playground (and ensure you have picked “responses” in the kebab menu drop-down), then you can use internal hosted tools:
Within the “image gen” tool, you then have the tool configuration options also documented when you deeply-expand the API reference, including the image creation models:
If the user than says something where that added image creation tool is useful, such as “create an image of a cute sea monster” (instead of calculate pi to 100 digits..), then the AI may call the tool.
First: be familiar with using Chat Completions, Responses, parsing streams and events as model responses, and delivering responses to users. Then you can move on to enabling tools and receiving their unique events.
Note: highly-undocumented is how this internal tool uses vision and the chat context: many or all images in the chat history are sent to the image tool model, a copy of the whole chat and not just an AI-written prompt. With gpt-image-1.5 seeming permanently billing for “input_fidelity”: “high” at up to 6000+ tokens besides the “vision” price and the “image input” price, a chat with images even unrelated to the latest picture can be an “input” bill of $0.06 for every image recently-discussed before you even pay for a generated image output (which can also be billed when it is refused).
If you want only prompt-based images and controlled costs: make your own function.