OpenAI Evals analogous to Fine Tuning?

chirag.shah285 · March 20, 2023, 6:47pm

Hey all, if I understand the eval process in github, is this process conceptually the same as fine-tuning your own model? I.e., providing a prompt and a response such that it will “learn” what you are trying to do.

curt.kennedy · March 20, 2023, 7:31pm

Yes, they are conceptually equivalent. If accepted, the data is used to “fine-tune” or “align” the various GPT models offered by OpenAI.

DAK · March 24, 2023, 9:44pm

Any chance anyone can direct me to where I can find info on how to actually try it for the first time? I think I’ve installed everything I needed to and yet I still can’t find anything that tells how to actually do anything with it. – DAK

anon10827405 · March 24, 2023, 9:50pm

There’s a guide here:

github.com

openai/evals/blob/main/docs/build-eval.md

# Building an eval

**Important: Please note that we are currently not accepting Evals with custom code!** While we ask you to not submit such evals at the moment, you can still submit modelgraded evals with custom modelgraded YAML files.

This document walks through the end-to-end process for building an eval, which is a dataset and a choice of eval class. The `examples` folder contains Jupyter notebooks that follow the steps below to build several academic evals, thus helping to illustrate the overall process.

The steps in this process are building your dataset, registering a new eval with your dataset, and running your eval. Crucially, we assume that you are using an [existing eval template](eval-templates.md) out of the box (if that's not the case, see [this example of building a custom eval](custom-eval.md)). If you are interested in contributing your eval publicly, we also include some criteria at the bottom for what we think makes an interesting eval.

We are looking for evals in the following categories:

- Over-refusals
- Safety
- System message steerability
- In-the-wild hallucinations
- Math / logical / physical reasoning
- Real-world use case (please describe in your PR how this capability would be used in a product)
- Other foundational capability

If you have an eval that falls outside this category but still is a diverse example, please contribute it!

This file has been truncated. show original

Basically you need to create a file which defines the eval, and then you need to build a separate JSONL file of slightly different prompt & completion pairs such as:

{“input”: [{“role”: “system”, “content”: “This is an exchange between Merlin, Arthur and Lancelot. Merlin outputs short tests. First output is: ‘1+2’, second is: ‘3+4’, … Arthur outputs the sum of last two digits that he had seen. Lancelot outputs an increasing integers number, starting from 0. The sequence in which the actors act: MAMALAMA. Output that exchange using the following syntax (including the trailing comma): Actor:,…”}], “ideal”: “Merlin:1+2,Arthur:3,Merlin:3+4,Arthur:7,Lancelot:0,Arthur:7,Merlin:5+6,Arthur:11,”}

You can also make a couple for few-shotting as well.

kloro2006 · March 26, 2023, 2:46am

I think another purpose of the evals effort can be compared to the purpose of people who try to break the Linux system in order to improve it. The evals I’ve read seem to be probing the system for weaknesses.

DAK · August 2, 2024, 11:57pm

Thanks. I don’t even know how I managed to post in here. I didn’t know I had a reply. I went to look for a way to post something, at ChatGPT’s request, and came across your reply, and I see hot to reply to the reply, but not how to make an original post… although obviously I managed to do so over a year ago.

Topic		Replies	Views
How do I get started doing evals? API	4	2117	May 12, 2023
Future of Fine Tuning models API	3	508	March 31, 2023
Cute Aggression vs Real Aggression Community	9	1133	March 19, 2023
Prompt to review acceptance criteria for openai/evals PRs Prompting gpt-4	0	1200	May 2, 2023
Fine tuning the model for our specific use case? API	4	996	December 27, 2023

OpenAI Evals analogous to Fine Tuning?

Related topics