How to efficiently create ground truth sets using GPT-4?

I would like to evaluate the performance of a fine-tuned LLM model I created that summarizes news text. To do so, I need a ground truth set, which I would like to be structured as a prompt and response pair.

The prompt would be some example news text, and the response would be my fine-tuned LLM’s response to it. I want to do this at scale, and was wondering if there was any helpful techniques or industry leading methods out there for generating something like over 500 such prompt-response pairs quickly? Much thanks!

1 Like

There’s this repository by openai you can use: GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

I’ve contributed a couple of times to it, so feel free to ask any questions! Also, openai recently released the evals on the dashboard with a user interface, so maybe that route might be easier!

https://platform.openai.com/evaluations

1 Like