How to efficiently create ground truth sets using GPT-4?

R2D2BOT · October 11, 2024, 2:05pm

I would like to evaluate the performance of a fine-tuned LLM model I created that summarizes news text. To do so, I need a ground truth set, which I would like to be structured as a prompt and response pair.

The prompt would be some example news text, and the response would be my fine-tuned LLM’s response to it. I want to do this at scale, and was wondering if there was any helpful techniques or industry leading methods out there for generating something like over 500 such prompt-response pairs quickly? Much thanks!

anon25271712 · October 11, 2024, 9:54pm

There’s this repository by openai you can use: GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

I’ve contributed a couple of times to it, so feel free to ask any questions! Also, openai recently released the evals on the dashboard with a user interface, so maybe that route might be easier!

https://platform.openai.com/evaluations

Topic		Replies	Views
Need human like response to test the model performance API	3	1513	November 29, 2023
Generating dataset of prompt-completion pairs for fine-tuning Prompting	0	1721	February 20, 2023
Prompt Evaluations at Scale for Production API gpt-4	1	979	August 4, 2024
Is an LLM which both generates and critiques its output a contradictory practice? Prompting gpt-4	3	257	November 23, 2024
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	4	555	January 14, 2026

How to efficiently create ground truth sets using GPT-4?

Related topics