Please do not respond unless you have actual facts, please no I think
answers.
OpenAI accepts evals
Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
First question
This is not a specific question because of a lack of details, here are some questions that aim to clarify the information being sought for the first question about evals:
- Which models will evals be used for? (e.g. GPT-4, GPT-5, future model(s))
- Will evals be used to enhance existing models? (e.g. GPT-3, GPT-3.5)
- Are evals only used for training new models?
- When will the use of evals impact the public models? For example, if evals are accepted and used for training GPT-5, the results would be available when GPT-5 is released. If evals are accepted and used for training GPT-4, which is already released, when would the results be available?
This should give a clearer picture of the information being sought for the first question about evals.
Second question
As a programmer evals are obviously reminiscent of test cases and yet the are named evals
which begs the question, instead of just listing a subset of results, could the eval instead be setup to call an expression in a programming language that can generate the full set of results and sample as many of the results as needed?
For example if an eval is created to test the case of a symbol such as a letter, then a
would result in lower
and A
would result in upper
. Now if one thinks beyond ASCII such as Unicode then the set of upper an lower cases grows immensely. So instead of listing all of the evals, it would make more sense to use Unicode categories, e.g. Ll and Lu.
I know this really does not fit the Discourse category General API discussion
, it does not really fit any Discourse category, so this one was chosen.