How to efficiently include image inputs in a multi-turn chat?

giovanni24 · June 24, 2025, 11:26am

Hello there!

I am working on a conversational agent that lets users upload images at any point in the conversation. Ideally the agent should be able to “see” what’s inside of the image when uploaded, and have some context of it in the subsequent messages.

Conversations with this agent are usually not extremely long, but still substantial, at around 10-30 turns.

My question is: given that an image is pretty token-heavy, what would be the most optimal approach to give this context to the agent, without passing each time the multimodal input?

If you have any experience on this I would be glad to hear your thoughts!
Thank you very much

merefield · June 24, 2025, 2:20pm

if you don’t want to always supply the images “inline”, one option is to create a function which looks behind to find images if the user prompts for something that may refer to and/or require a recent image.

in my chatbot I have implemented both strategies and you can switch them in settings.

sps · June 24, 2025, 3:33pm

Images at "detail": "low" cost only 85 tokens for gpt-4o, 4.1, or 4.5; and 65 tokens for o-series models.

You can also switch to the Responses API and save the time required to upload/fetch existing images for each turn by supplying the previous response ID to manage conversation state.

jai · June 24, 2025, 3:41pm

Vision capabilities have come a long way, and you could very well just use the "detail": "low" setting with either 4o or the 4.1 model. Here’s some more info on the pricing, and you can also use this neat calculator available on the main pricing page under “How is pricing calculated for images?” (it’s all the way at the bottom).

I’ve tried using it with some dense .pdf screenshots, and it’s done a surprisingly good job of even converting schematics like flow charts into markdown text with all the content intact.

giovanni24 · June 24, 2025, 7:20pm

Thank you very much guys, I’ll go with the low resolution!

zuck · August 15, 2025, 9:03pm

Hey guys re-opening this!

What do you do if there ends up being say 10 or more images totaling in the chat.

I don’t really want to send all 10 at low res each time as that can cause confusion.

ChatGPT seems to handle this well, but not sure how they do it.

Topic		Replies	Views
Efficiently Maintaining Chat Context with Images in Conversation History API api , gpt-4o	2	782	June 26, 2024
Integrate both gpt-4 and gpt-4 vision in same chat API gpt-4 , api , gpt-4-vision	4	1462	February 26, 2024
How to best work with 100s of images API gpt-4	0	1525	January 17, 2024
Integrating images in assistant's responses Coding with ChatGPT	4	63	August 20, 2025
Billing Details for Image Processing in Multi-Turn Conversations with GPT-4V API API api	0	670	November 7, 2023

How to efficiently include image inputs in a multi-turn chat?

Related topics