Evaluations Beta custom eval prompt

I’m trying to evaluate my stored prompts with the new Evaluations feature. I want to use custom evaluations via a prompt. Most of the times all evaluations fail. Is there a way to see the evaluation response to check what’s going on? I think the issue is most likely related to the evals prompt not outputting the exact label.

Also, it states that you can use {{}} for placeholders. But what placeholders are there? (input & output?)

The current docs about this are pretty limited…

Thanks a ton :muscle:t3:

2 Likes

It seems that the current state of the Evaluations feature is mega-beta. I can’t even create one (cannot add testing criterias)

Hi @fede1 and @Konsti, thanks for trying evals out early!
We releases a ton of bug fixes and features during November. Would love it if you tried it again, both of those bugs should be fixed.

5 Likes

Top thank you for the update! There is one feature that would be pretty amazing:
After running the Evaluations I would like to create a new dataset containing only the completions that did pass the Evals.

The use case is the following:
We are currently storing all our completions but not all of them are good quality. We would like to use the Evaluations feature initially to filter the best ones and use only these for finetuning.

The only feature missing to make this workflow work fully within the OpenAI playground is the possibility to generate a new dataset based on the completions that did pass the evals. Even better would be the possibility to add custom metadata tags to these completions.

What do you think @defjosiah ?

Thanks :muscle:t3::rocket:

1 Like

@Konsti this is fully on our radar :slight_smile: We’re working on export options.

2 Likes

Hi @Konsti:
We have two more export options available now!
You can Export data from an eval (Export button at the top) now.
You can also re-download your dataset.

Here’s a little script that can extract just the passed rows.

import json

def all_passes_true(passes_dict):
    # Check if all values in the passes dictionary are True
    return all(value is True for value in passes_dict.values())

def extract_items_from_jsonl(file_path):
    seen_data_source_idxs = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            data = json.loads(line)
            
            # Check if 'passes' and 'item' and 'data_source_idx' exist in the line
            if 'passes' in data and 'item' in data and 'data_source_idx' in data:
                # Check if all the passes are True and the data_source_idx hasn't been seen
                if all_passes_true(data['passes']) and data['data_source_idx'] not in seen_data_source_idxs:
                    # Print or otherwise handle the 'item'
                    print(json.dumps(data['item'], ensure_ascii=False))
                    seen_data_source_idxs.add(data['data_source_idx'])

if __name__ == "__main__":
    input_file = "input.jsonl"
    extract_items_from_jsonl(input_file)

(yes o1 did write this script, but I tested it!)

1 Like