Using Evals with fine tuned model

CSalem · June 28, 2024, 1:38am

I’ve been playing around creating my own fine tuned models and I was wondering what is the best way to test performance between models. I’ve added validation data, and that’s great, but when I pass it to our testers they want something automated to run more. I’ve been looking at the OpenAI Evals stuff but I can’t get it to run on a fine tuned model, is there a way to do that?

vb · June 30, 2024, 2:48pm

Hi!

When starting the eval we have to specify the model that we want to evaluate

oaieval gpt-3.5-turbo <eval_name>

Have you tried putting the name of your fine-tuned model here?

dignity_for_all · June 30, 2024, 3:19pm

It seems that the current OpenAI Evals Python module cannot correctly handle the fine-tuned model names.

File “evals\evals\record.py”, line 343, in init
with bf.BlobFile(log_path, “wb”) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “evals\venv\Lib\site-packages\blobfile_ops.py”, line 358, in BlobFile
return default_context.BlobFile(
^^^^^^^^^^^^^^^^^^^^^^^^^
File “evals\venv\Lib\site-packages\blobfile_context.py”, line 1014, in BlobFile
f = ProxyFile(
^^^^^^^^^^^
File “evals\venv\Lib\site-packages\blobfile_context.py”, line 1420, in init
super().init(local_path, mode=mode)
OSError: [Errno 22] Invalid argument: '/tmp/evallogs/***************ft:gpt-3.5-turbo-1106:organization:my-experiment:---------------.jsonl’

By making modifications to the Evals module, it should be possible to evaluate fine-tuned models. However, in its current official state, the Evals module cannot perform evaluations on fine-tuned models.

vb · June 30, 2024, 3:31pm

Thanks for looking into this @dignity_for_all !

Is this a different error message then when putting a wrong model name, like GPT-5 for example?

dignity_for_all · June 30, 2024, 6:00pm

No, it is a different error message.

Please let me correct my previous statement:

OSError: [Errno 22] Invalid argument:

This error simply means that characters like colons in the filename “ft3.5-turbo-1106:” are invalid as a filename.
This error does not occur if the logs are not saved.

If you run evals without saving logs, you can perform the evaluation without any problems.

The specific command is as follows:

oaieval ft:gpt-3.5-turbo-1106:organization:************:---------- eval_name –dry-run

This command allows you to perform the evaluation without outputting logs.

[2024-07-01 02:28:33,374] [oaieval.py:275] Found --/-- sampling events with usage data
[2024-07-01 02:28:33,375] [oaieval.py:283] Token usage from – sampling events:
completion_tokens: ----
prompt_tokens: ----
total_tokens: ----
[2024-07-01 02:28:33,376] [record.py:263] Final report: {‘accuracy’: ****************, >‘boostrap_std’: ****************7, ‘usage_completion_tokens’: -----, >‘usage_prompt_tokens’: ----, ‘usage_total_tokens’: ----}. Not writing anywhere.
[2024-07-01 02:28:33,376] [oaieval.py:233] Final report:
[2024-07-01 02:28:33,377] [oaieval.py:235] accuracy: ****************
[2024-07-01 02:28:33,377] [oaieval.py:235] boostrap_std: ****************
[2024-07-01 02:28:33,377] [oaieval.py:235] usage_completion_tokens: ----
[2024-07-01 02:28:33,377] [oaieval.py:235] usage_prompt_tokens: ----
[2024-07-01 02:28:33,378] [oaieval.py:235] usage_total_tokens: ----

As long as you are evaluating a fine-tuned model of gpt-3.5-turbo, no changes to the repository are needed.

I apologize for my incorrect statement. It was based on my inaccurate interpretation.

Hopefully that is helpful to some

Topic		Replies	Views
OpenAI Evals analogous to Fine Tuning? API	5	1129	August 2, 2024
Evaluating a finetuned model API	3	2236	November 29, 2023
How to test fine-tuned model API	3	2131	April 2, 2023
When will Evaluation API be ready? Feedback evals	1	35	December 20, 2024
Future of Fine Tuning models API	3	473	March 31, 2023

Using Evals with fine tuned model

Related topics