How to efficiently include image inputs in a multi-turn chat?

Hello there!

I am working on a conversational agent that lets users upload images at any point in the conversation. Ideally the agent should be able to “see” what’s inside of the image when uploaded, and have some context of it in the subsequent messages.

Conversations with this agent are usually not extremely long, but still substantial, at around 10-30 turns.

My question is: given that an image is pretty token-heavy, what would be the most optimal approach to give this context to the agent, without passing each time the multimodal input?

If you have any experience on this I would be glad to hear your thoughts!
Thank you very much :slight_smile:

1 Like

if you don’t want to always supply the images “inline”, one option is to create a function which looks behind to find images if the user prompts for something that may refer to and/or require a recent image.

in my chatbot I have implemented both strategies and you can switch them in settings.

4 Likes

Images at "detail": "low" cost only 85 tokens for gpt-4o, 4.1, or 4.5; and 65 tokens for o-series models.

You can also switch to the Responses API and save the time required to upload/fetch existing images for each turn by supplying the previous response ID to manage conversation state.

3 Likes

Vision capabilities have come a long way, and you could very well just use the "detail": "low" setting with either 4o or the 4.1 model. Here’s some more info on the pricing, and you can also use this neat calculator available on the main pricing page under “How is pricing calculated for images?” (it’s all the way at the bottom).

I’ve tried using it with some dense .pdf screenshots, and it’s done a surprisingly good job of even converting schematics like flow charts into markdown text with all the content intact.

4 Likes

Thank you very much guys, I’ll go with the low resolution!

3 Likes