I would like to evaluate the performance of a fine-tuned LLM model I created that summarizes news text. To do so, I need a ground truth set, which I would like to be structured as a prompt and response pair.
The prompt would be some example news text, and the response would be my fine-tuned LLM’s response to it. I want to do this at scale, and was wondering if there was any helpful techniques or industry leading methods out there for generating something like over 500 such prompt-response pairs quickly? Much thanks!