Using Evals with fine tuned model

I’ve been playing around creating my own fine tuned models and I was wondering what is the best way to test performance between models. I’ve added validation data, and that’s great, but when I pass it to our testers they want something automated to run more. I’ve been looking at the OpenAI Evals stuff but I can’t get it to run on a fine tuned model, is there a way to do that?


When starting the eval we have to specify the model that we want to evaluate

oaieval gpt-3.5-turbo <eval_name>

Have you tried putting the name of your fine-tuned model here?

1 Like

It seems that the current OpenAI Evals Python module cannot correctly handle the fine-tuned model names.

File “evals\evals\”, line 343, in init
with bf.BlobFile(log_path, “wb”) as f:
File “evals\venv\Lib\site-packages\”, line 358, in BlobFile
return default_context.BlobFile(
File “evals\venv\Lib\site-packages\”, line 1014, in BlobFile
f = ProxyFile(
File “evals\venv\Lib\site-packages\”, line 1420, in init
super().init(local_path, mode=mode)
OSError: [Errno 22] Invalid argument: '/tmp/evallogs/***************ft:gpt-3.5-turbo-1106:organization:my-experiment:

By making modifications to the Evals module, it should be possible to evaluate fine-tuned models. However, in its current official state, the Evals module cannot perform evaluations on fine-tuned models.

1 Like

Thanks for looking into this @dignity_for_all !

Is this a different error message then when putting a wrong model name, like GPT-5 for example?

No, it is a different error message.

Please let me correct my previous statement:

OSError: [Errno 22] Invalid argument:

This error simply means that characters like colons in the filename “ft3.5-turbo-1106:” are invalid as a filename.
This error does not occur if the logs are not saved.

If you run evals without saving logs, you can perform the evaluation without any problems.

The specific command is as follows:

oaieval ft:gpt-3.5-turbo-1106:organization:************:---------- eval_name –dry-run

This command allows you to perform the evaluation without outputting logs.

[2024-07-01 02:28:33,374] [] Found --/-- sampling events with usage data
[2024-07-01 02:28:33,375] [] Token usage from – sampling events:
completion_tokens: ----
prompt_tokens: ----
total_tokens: ----
[2024-07-01 02:28:33,376] [] Final report: {‘accuracy’: ****************, >‘boostrap_std’: ****************7, ‘usage_completion_tokens’: -----, >‘usage_prompt_tokens’: ----, ‘usage_total_tokens’: ----}. Not writing anywhere.
[2024-07-01 02:28:33,376] [] Final report:
[2024-07-01 02:28:33,377] [] accuracy: ****************
[2024-07-01 02:28:33,377] [] boostrap_std: ****************
[2024-07-01 02:28:33,377] [] usage_completion_tokens: ----
[2024-07-01 02:28:33,377] [] usage_prompt_tokens: ----
[2024-07-01 02:28:33,378] [] usage_total_tokens: ----

As long as you are evaluating a fine-tuned model of gpt-3.5-turbo, no changes to the repository are needed.

I apologize for my incorrect statement. It was based on my inaccurate interpretation.

Hopefully that is helpful to some :slightly_smiling_face:

1 Like