Function Calling With Vision, best order?

Hey there o/

I’m building an app that analyzes images and provides some structured data from it. I have created a prompt that the LLM should base its criteria of analysis on some specific points.

But I don’t know if I should place this prompt while using the Vision API or the Function call.

Prompt example: "You should analyze this marketing image with 2 criteria:

  1. Big (Does it have a big affirmation?)
  2. Easy (Does it make it look somewhat easy?) Then you should provide a grade from 0 to 10 for each."

So I have 2 options:

  1. Place the prompt while reading the image so it tries to analyze only the criteria needed. THEN use the function calling to get the specific data structured.
  2. Read every detail of the image and send it over to the function calling API with the prompt so it makes the analysis while structuring the response.

The vision API does not seem to have an assistant role, which would probably be my go-to for this. Also, I want the best performing solution, not really the cheapest.