Obtaining dimensions and bounding boxes from AI vision is a skill called grounding.
You can, for example, see how Azure can augment gpt-4-vision with their own vision products.
Other AI vision products like MiniGPT-v2 - a Hugging Face Space by Vision-CAIR can demonstrate grounding and identification.
Such metrics are needed as a basis for measurement.
gpt-4-vision alone might give you a description and be coaxed into extrapolation, but it is unlikely to be reliable.