Evaluations Beta custom eval prompt

Konsti · October 12, 2024, 8:58am

I’m trying to evaluate my stored prompts with the new Evaluations feature. I want to use custom evaluations via a prompt. Most of the times all evaluations fail. Is there a way to see the evaluation response to check what’s going on? I think the issue is most likely related to the evals prompt not outputting the exact label.

Also, it states that you can use {{}} for placeholders. But what placeholders are there? (input & output?)

The current docs about this are pretty limited…

Thanks a ton

fede1 · October 30, 2024, 9:14pm

It seems that the current state of the Evaluations feature is mega-beta. I can’t even create one (cannot add testing criterias)

defjosiah · December 3, 2024, 7:50pm

Hi @fede1 and @Konsti, thanks for trying evals out early!
We releases a ton of bug fixes and features during November. Would love it if you tried it again, both of those bugs should be fixed.

Konsti · December 4, 2024, 8:29am

Top thank you for the update! There is one feature that would be pretty amazing:
After running the Evaluations I would like to create a new dataset containing only the completions that did pass the Evals.

The use case is the following:
We are currently storing all our completions but not all of them are good quality. We would like to use the Evaluations feature initially to filter the best ones and use only these for finetuning.

The only feature missing to make this workflow work fully within the OpenAI playground is the possibility to generate a new dataset based on the completions that did pass the evals. Even better would be the possibility to add custom metadata tags to these completions.

What do you think @defjosiah ?

Thanks

defjosiah · December 4, 2024, 6:21pm

@Konsti this is fully on our radar We’re working on export options.

defjosiah · December 18, 2024, 1:28am

Hi @Konsti:
We have two more export options available now!
You can Export data from an eval (Export button at the top) now.
You can also re-download your dataset.

Here’s a little script that can extract just the passed rows.

import json

def all_passes_true(passes_dict):
    # Check if all values in the passes dictionary are True
    return all(value is True for value in passes_dict.values())

def extract_items_from_jsonl(file_path):
    seen_data_source_idxs = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            data = json.loads(line)
            
            # Check if 'passes' and 'item' and 'data_source_idx' exist in the line
            if 'passes' in data and 'item' in data and 'data_source_idx' in data:
                # Check if all the passes are True and the data_source_idx hasn't been seen
                if all_passes_true(data['passes']) and data['data_source_idx'] not in seen_data_source_idxs:
                    # Print or otherwise handle the 'item'
                    print(json.dumps(data['item'], ensure_ascii=False))
                    seen_data_source_idxs.add(data['data_source_idx'])

if __name__ == "__main__":
    input_file = "input.jsonl"
    extract_items_from_jsonl(input_file)

(yes o1 did write this script, but I tested it!)

Topic		Replies	Views
Evaluation UI is not performing any edits to inputs API evals	9	87	May 8, 2025
Evals product in Playground - Announcement and feedback Feedback playground , evals	7	427	April 16, 2025
Evals in the OpenAI dashboard Announcements	2	722	December 2, 2024
Using Evals with fine tuned model API fine-tuning	4	669	June 30, 2024
New Beta Eval Feature - a few tips Feedback evals	2	148	December 4, 2024

Evaluations Beta custom eval prompt

Related topics