Prompt for Image to JSON conversion

I have process diagram present in a image. I am converting this image to base64 encoding and attaching it to my prompt. I have given instructions to prompt to understand the image and give output in the JSON structure which I have provided in my prompt. It works but not always. I am looking to improve its accuracy.
What are the specific things to look when I am writing prompt for such conversions?

Welcome to the community!

This is generally a tricky undertaking because of the way the models hallucinate.

I would suggest adding a “thinking” field to the top of your json schema, where you tell the model to explicitly reason about the contents of the image before translating it into your specific json structure.

The issue is that as soon as the model misrepresents something, it’s very difficult to iron that kink back out again. That makes using LLM vision in production not easy.

What specific failure modes are you encountering? It’s probably best to work the issues out one by one until you get to an acceptable level of reliability.