Evaluating a finetuned model


I finetuned my first ever model, and I need to test it against its prepared test dataset, and I have zero idea how. Could someone point me in the right direction?


how do you use the API in general?

In your interface, you should have the list of your finetuned models. Each model will have an unique name. Just copy-paste it and tell the API to use it. As an example, in python, you would call it as

response = client.chat.completions.create(
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}

As you can see, it is sufficient to specify it as ‘model’ in your response definition.

By the way, that code was taken from the link at the end. Be sure to read it

sorry if i was not clear. I meant scores like-

Test Set Dev Set
weighted Acc (%) weighted Acc (%)

is there a way to obtain scores like those? afaik it used to be supported by wandb but does not now.

Oh, I get what you’re asking.

If you want to do it with new data, you have to devise a test: a way to understand if an output for your model is right or wrong (usually by knowing what the output should be in known cases).

If I remember correctly, while finetuning your model (and also after) you can follow the accuracy on the train set from the same page of the openai website. In the fine tuning interface, when you select a model you can track its training loss. If you also uploaded a test/validation set, you can also see the loss on the test set. I unfortunately don’t know how OpenAI defines its loss functions, so I can’t help you on reproducing that.

If you mean to use the fine tuning interface to upload a new dataset and get the loss on that dataset, while it would be an interesting feature, I’m afraid it is impossible with the interface that we have today.