Prompt to review acceptance criteria for openai/evals PRs

Some repos get hit by a deluge of PRs and maintainers have a hard time keeping up.

openai/evals repo is definitely one of them, with 462 open PRs and only a handful actually getting any sort of feedback.

So, in that vein, I’ve been trying to craft a prompt that folks can run to get some feedback on their PR. Ideally the maintainers will drive/help with this prompt, but they are clearly busy, and I’d love to get some feedback from folks here as well.

You can see some acceptance criteria for PRs here - evals/build-eval.md at main · openai/evals · GitHub

I’m initially focusing on the most common case, which is ‘match’ evals, where the ideal must match exactly to the response. We can extend to the other eval scenarios once we have a good prompt for this.

The prompt is based on the sampled eval json, usually around 5 or so that contributors would provide in the PR template. I do this so as not to overwhelm GPT4 and cause its attention to wander, though perhaps that’s not optimal?

The prompt (with example PR) so far:

Please critique this model eval based on just the sample of questions and responses given.
Assume that system content is the system prompt to guide GPT4.
User content is the content which is responsible for the question, and ‘ideal’ is the field responsible for looking for an exact match.
Do not critique such things as diversity, sample size, metrics, or asking to include more information. Keep the critique specific to this rubric:

  1. The eval should be thematically consistent. We’d like to see a number of prompts all revolving around the same use case, subject domain, failure mode, etc.
  2. The eval should be challenging. If GPT-4 or GPT-3.5-Turbo do well on all of the prompts, this is not as interesting. Of course, the eval should also be possible given the models’ limitations and constraints. Oftentimes, a good rule of thumb is whether a human (potentially a subject expert) could do well on the prompts.
  3. The eval should be directionally clear. The data should include good signal around what is the right behavior. This means, for example, high-quality reference answers or an exhaustive rubric for evaluating answers.
  4. The eval should be carefully crafted. Before you submit, you should think through whether you have engineered your prompts for good performance, whether you are using the best eval template, whether you have spot checked your results to ensure accuracy, etc.

Here is the eval data to be reviewed:

[{‘input’: [{‘role’: ‘system’, ‘content’: ‘You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language’}, {‘role’: ‘user’, ‘content’: ‘慈母手中线,游子身上衣。’}], ‘ideal’: ‘孟郊’}, {‘input’: [{‘role’: ‘system’, ‘content’: ‘You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language’}, {‘role’: ‘user’, ‘content’: ‘烟笼寒水月笼沙,夜泊秦淮近酒家。’}], ‘ideal’: ‘杜牧’}, {‘input’: [{‘role’: ‘system’, ‘content’: ‘You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language’}, {‘role’: ‘user’, ‘content’: ‘白日依山尽,黄河入海流。’}], ‘ideal’: ‘王之涣’}, {‘input’: [{‘role’: ‘system’, ‘content’: ‘You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language’}, {‘role’: ‘user’, ‘content’: ‘寂寞天宝后,园庐但蒿藜。’}], ‘ideal’: ‘杜甫’}, {‘input’: [{‘role’: ‘system’, ‘content’: ‘You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language’}, {‘role’: ‘user’, ‘content’: ‘春城无处不飞花,寒食东风御柳斜。’}], ‘ideal’: ‘韩翊’}]