To get ready for possibly using RFT to improve my application, I’m going back and really working on evals, and possibly SFT before moving onto the new thing.
Kind of confused by the new Eval UI, and maybe its super obvious…but if I just did this:
Where I took a couple hundred samples that were stored through ChatCompletion API, and ran them through the default auto-grader that says its using o3-mini to score everything…
Does that mean all 180 samples were re-run using each of the different models instead of the gpt-4.1-2025-04-14 that was used to generate and store the completions?
If so, I can take it from this that maybe switching to gpt-4.1-mini will actually get me 100% accuracy compared to the 98% I received with the model that each completion used originally? It even looks like I could fine-tune nano as its almost close enough, right?
When creating a new eval set, all I could do was select “new models” to test out, and it was all of those…even though I had already used basically what was gpt-4.1 (at least for now).
Also, the auto-grader is super-detailed, and these are much higher rankings than I would get previous when I wrote my own bespoke grading evals - but I wonder if more comprehensive generality like is present in this default is better than specific use cases for my domain, where I check against very specific criteria?
Lastly, when setting a grader, and I have perhaps 5 or 6 different things I want to check for, do I wrap them all together into one master grader, or is it best practice to split them out across different simpler graders?
Just a note, without actually going into your question: quick evals is a button for quick bills.
Check your data controls alongside the “complementary tokens” to see what you might be opted into that hides the huge bill you’d get after free small experiments.
Food for thought - How can one expect to develop an eval completely in-service when both the subject and the judge are randomly altered on you by unpublished changes?
You are clever. You explored all the resources that OpenAI has offered. You are still at a loss.
There is a lack of documentation.
This is a community of fellow users. The only way to answer better than you can do yourself is by personal expenditure of time: doing a thorough exploration of the product and seeing the results and inferring the purposes and the actions being done by the evals judging (which in the UI, are a massive leap from API functions that are also scant). Then by composing new information fit for consumption.
So the topic “how to use” that you have written needs not just a personal answer, but the documentation that answers all concerns, without astonishment of what the product offers.
In this case: Authored by an employee, not a freelancer that is embracing “free” for the benefit of a corporation.
One will notice that this reply is just an explanation of why I only offer a helpful warning about one wrong button press in evals costing you, and that it actually may be delivering detection of changes in the judge model and not the target model. It is not for the “hateful, abusive” flag button.
Then why one is unlikely to see any community responses in this topic.
Dude. We get it. You complain about the documentation all the time. If I have to hear one more time about how key information is hidden behind clicks…I’m sure it won’t be the last.
Stop making this forum about you, and if you can’t stay on topic please don’t fill my inbox with irrelevant data.
Does that mean all 180 samples were re-run using each of the different models instead of the gpt-4.1-2025-04-14 that was used to generate and store the completions?
Correct!
(q1) If so, I can take it from this that maybe switching to gpt-4.1-mini will actually get me 100% accuracy compared to the 98% I received with the model that each completion used originally? It even looks like I could fine-tune nano as its almost close enough, right?
(q2) Also, the auto-grader is super-detailed, and these are much higher rankings than I would get previous when I wrote my own bespoke grading evals - but I wonder if more comprehensive generality like is present in this default is better than specific use cases for my domain, where I check against very specific criteria?
Basically yes, however, there is a caveat (part of the question that you asked later on, so I’m grouping them).
The “auto-grader” is intentionally generic, so it can apply to a lot of completions, it’s a good starting point for iteration, we would suggest that you use this and add specific use-cases for your domain, and iterate on these graders until you get a set that you’re comfortable with and show a good number of pass and fail. This will then go directly into your RFT job (evals and RFT share the same graders).
Lastly, when setting a grader, and I have perhaps 5 or 6 different things I want to check for, do I wrap them all together into one master grader, or is it best practice to split them out across different simpler graders?
This is a bit of preference for you to analyze! If you want to have fairly granular passes or fails for different parts of your prompts like:
Grader 1: Check that it never says the word XXXX (string check contains grader)
Grader 2: Check that the response follows your style guidelines for how to talk to customers (score model grader, 1-5 scale)
Then it makes sense to split out!
If you’re interested in RFT or any fine-tuning, I would suggest you get a really good eval, that has a good selection of examples that “pass” and “fail”, and then kick of a fine-tuning job.
Once you get a successful model out of the fine-tuning, you can just add it to your existing eval and (hopefully) see the number go up!