OpenAI Evals analogous to Fine Tuning?

Hey all, if I understand the eval process in github, is this process conceptually the same as fine-tuning your own model? I.e., providing a prompt and a response such that it will “learn” what you are trying to do.

Yes, they are conceptually equivalent. If accepted, the data is used to “fine-tune” or “align” the various GPT models offered by OpenAI.


Any chance anyone can direct me to where I can find info on how to actually try it for the first time? I think I’ve installed everything I needed to and yet I still can’t find anything that tells how to actually do anything with it. – DAK

There’s a guide here:

Basically you need to create a file which defines the eval, and then you need to build a separate JSONL file of slightly different prompt & completion pairs such as:

{“input”: [{“role”: “system”, “content”: “This is an exchange between Merlin, Arthur and Lancelot. Merlin outputs short tests. First output is: ‘1+2’, second is: ‘3+4’, … Arthur outputs the sum of last two digits that he had seen. Lancelot outputs an increasing integers number, starting from 0. The sequence in which the actors act: MAMALAMA. Output that exchange using the following syntax (including the trailing comma): Actor:,…”}], “ideal”: “Merlin:1+2,Arthur:3,Merlin:3+4,Arthur:7,Lancelot:0,Arthur:7,Merlin:5+6,Arthur:11,”}

You can also make a couple for few-shotting as well.

I think another purpose of the evals effort can be compared to the purpose of people who try to break the Linux system in order to improve it. The evals I’ve read seem to be probing the system for weaknesses.