How do I know if my fine-tuned model is actually better than the base model? (For MATH-related use cases)

For my specific use case, I want to create a fine-tuned model from either gpt 3.5 turbo or gpt 4 with vision enhancement and OCR capabilities so that the model can “understand” what they are fed on any math problems and churn correct answers and representable graphics to illustrate its points (I’m not expecting much here… well, maybe someone has already done this)

Source: https://openreview.net/pdf?id=E4hK8t7Fts

These are the cases I need to answer:

  1. After being fine-tuned, how do I evaluate the fine-tuned model is actually better IF the result fluctuates? I mean the bot can be right, but sometimes it is not. Or in the worst case, it does not answer explicitly or apologizing?

  2. Can I use existing LLM model that is focused on Math (like Wolfram) and integrate it to GPT 3.5 or GPT 4

  3. How can GPT 4 (from API) produce math-related answers from users’ questions, especially statistical calculations with clear graphical representations? or when the users input pictures or handwriting then I would expect the bot to produce pictures that represent its answer?
    So far, I had no luck in producing correct and representable answers, even when doing primary grade math/statistics… not to mention my target of Phd-level math.

Thank you so much before.

Best wish,