How to optimize + what do you recommend?

Generally the way that you would perform an eval is to have a highest-quality or human evaluator create the desired answer for any series of input, in terms of instruction-following and quality of answering.

Then you create a critical “judge” prompt against a highest-quality AI model able to decide if the output is satisfactory against the truth output, or if it has failings. Then judge the judging.

That’s how you can automate “a feeling” to the model not performing well, instead of having completely human “which is better” (that takes reading comprehension and knowledge of what is actually expected of the AI within the specialization.)

Your “messy data” could be something easily worked through linearly, or it could be something that requires total observation and high quality language understanding, along with knowledge and understanding. The former, fancy repeating, is where a less expensive model can work for you.