Vision Model Fine-tuning Query

I am fine-tuning geometrical images for interpretation that need to be returned in JSON form in this manner

{
“ShapeAnalysis”: {
“shapes”: [
{
“type”: “Lorem Ipsum”, #My shape names will be custom. A rectangle might be called “Content placeholder” for eg
“parameters”: {
“orientation” : horizontal
“text”: “A1. Lorem Ipsum…”, # random text

    }
  },
  {
    "type": "SpecialShape2",
    "parameters": {
      "Steps": 10
      "text": "Lorem ipsum",
      "type": Underline/Box/Lorem ipsum 
    }
  }
]

}
}

The thing is I have shapes that a normal ChatGPT would definitely be able to interpret it in its own words but I need it in this format.

Each image can have 5-10 such shapes. Each shape can have 2-3 parameters like text and whatever is relevant for that shape. There are 300 such shapes.
I’m thinking of fine-tuning it on 1 million rows trying to keep it as balanced as I can

Will this work? Is the data too less, is the task too complex? Is this usecase even relevant for fine-tuning?

I would love any helpful insights:

How much data is too less, how much is enough, how much is an overkill for my usecase. Is my usecase too complex for the vision model to learn?

Is there anything else I can optimize for?

Thanks