So I gave up and wrote a function.
With hindsight this has one huge advantage: you can use a cheaper LLM to hold the general conversation and restrict calls to GPT 4 Vision for just image analysis.
UPDATE: looks like Open AI not only released an updated preview model with vision with function calling but also created a new alias for gpt-4-turbo
which has vision!
gpt-4-turbo
More here: https://platform.openai.com/docs/models/continuous-model-upgrades
My solution is still a good one because it keeps the costs down.