A few questions about evals? (GitHub OpenAI/evals)

Please do not respond unless you have actual facts, please no I think answers.

OpenAI accepts evals

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

First question

This is not a specific question because of a lack of details, here are some questions that aim to clarify the information being sought for the first question about evals:

  • Which models will evals be used for? (e.g. GPT-4, GPT-5, future model(s))
  • Will evals be used to enhance existing models? (e.g. GPT-3, GPT-3.5)
  • Are evals only used for training new models?
  • When will the use of evals impact the public models? For example, if evals are accepted and used for training GPT-5, the results would be available when GPT-5 is released. If evals are accepted and used for training GPT-4, which is already released, when would the results be available?

This should give a clearer picture of the information being sought for the first question about evals.

Second question

As a programmer evals are obviously reminiscent of test cases and yet the are named evals which begs the question, instead of just listing a subset of results, could the eval instead be setup to call an expression in a programming language that can generate the full set of results and sample as many of the results as needed?

For example if an eval is created to test the case of a symbol such as a letter, then a would result in lower and A would result in upper. Now if one thinks beyond ASCII such as Unicode then the set of upper an lower cases grows immensely. So instead of listing all of the evals, it would make more sense to use Unicode categories, e.g. Ll and Lu.

I know this really does not fit the Discourse category General API discussion, it does not really fit any Discourse category, so this one was chosen.

All we get on the GitHub is inside the disclaimer:

“OpenAI reserves the right to use this data in future service improvements to our product.”

Meaning OpenAI can use evals you submit to improve their models as long as they see fit.

As for when the use of evals will impact public models, it depends on the specific model and development timeline. If evals are used to train GPT-5, the results would be available when GPT-5 is released. However, if evals are used to improve GPT-4, which is already released, the results would be integrated in future updates or iterations of the model. The exact timeline for these updates may vary and is subject to OpenAI’s development schedule.


On your second question, LLMs rate answers based on semantic similarity. So things like an upper- or lowercase A are not relevant. The tokens would be semantically very similar if not identical. Each token or set of tokens is a vector and semantically similar vectors are the same whether they are spelled or worded differently, or in completely different languages. There isn’t really an exact parallel to programming in that sense. I hope i’m answering your question. I pretty sure that the right answer, someone correct me if I’m wrong.

Good enough for the details (or lack thereof) in the question.

My take on why you note that is that the tokenization doesn’t care about upper or lower case because LLMs are typically created to understanding written English, (not biased against other languages it is just the norm), so creating separate tokens for upper and lower case is not efficient.

The reason for specifically noting Ll and Lu is that I am curious to see if LLMs can understand processing models for programming languages, in particular can an LLM understand recursion? (That is a rhetorical question, I know trying to compare an LLM with a programming language is like trying to compare and iPhone with a fish.) In that light I plan to see if any of the models can do untyped lambda calculus, typed lambda calculus and also syntactic unification. It was the syntactic unification as used with Prolog that needed Ll and Lu to differentiate between atoms and variables.


1 Like

I think the question of whether LLMs can understand programming concepts has been, at least partially, answered in products like Github Copilot and one I hope I get access to soon, OpenAI’s Code Interpreter. As far as the calculus goes, LLMs have a weakness in the area of math. But aided by plugins and tools like Langchain, that access APIs like Wolfram Alpha, they’ll probably excel in time.

I don’t understand why you would ask speculative questions but restrict any speculative answers when we have the same level of access to information as you. (actually, I believe ( :face_with_open_eyes_and_hand_over_mouth: ) you have more than us) Why not allow for speculation and community discussion?

Agreed. But they would be more similar to fine-tuning. They were originally released with GPT-4, with failing tests (I believe it was <70%) allowing access to GPT-4 via API - which was brutal because at the time we couldn’t even test it yet with GPT-4. You can use evals with any of the API models you have access to.

For your other questions, you can communicate directly with the developers on the Github for your answers.

You can create/download evals and run them privately if you wanted to benchmark GPT’s capability with your prompts. As they are, they would definitely (because I’m a sith that only deals in absolutes apparently) be used to identify weak points and improve GPT.

Because most of what is on this site, billing questions, is not what this Discourse forum was intended. Also there are just to many “me to” replies or asking the same question that has been asked. It is not uncommon for such to happen on other forums, someone starts a topic and then others go off-topic and there is nothing that can be done. If you want to start a similar topic with an open discussion please feel welcome, just don’t reference this topic.

I considered that but I don’t really see the OpenAI staff responding to issues and such for the eval repol

One of the things I am trying to understand is if LLMs can do evaluations of recursion and then if that is true would it be worth the trouble to create all of the evals for doing Lambda Calculus and Prolog. It would have to start from the ground up with properly parsing the input understanding the different processing model, etc. Not something I want to invest time in if it will not be used to update the production LLMs. I know I can do my own LLM but I would rather see it be made public with a widely used LLM.


That’s very fair. Great questions & I hope someone with appropriate expertise can answer them. I apologize for my snarky comment. What you’re suggesting would demand an overhaul to their current eval system, and wouldn’t fit with the ideology. I can’t see how they could even use it in their training.

1 Like

I find this very interesting:

But it’s also a more complex topic and that involves specialized knowledge about advanced math to answer.
If we go back to the example from your second question, it might be a bit easier to understand:

Here’s some python spaghetti I made according to your example:

import openai
import unicodedata
import random

openai.api_key = 'your-api-key'

def unicode_case(symbol):
    category = unicodedata.category(symbol)
    if category == 'Ll':
        return 'lower'
    elif category == 'Lu':
        return 'upper'
        return 'unknown'

def unicode_case_openai(symbol):
    response = openai.Completion.create(
      prompt=f"What is the Unicode category of the symbol '{symbol}'?",

    response_text = response['choices'][0]['text'].strip()
    if 'lowercase letter' in response_text:
        return 'lower'
    elif 'uppercase letter' in response_text:
        return 'upper'
        return 'unknown'

def generate_symbols(n):
    return [chr(random.randint(0, 10000)) for _ in range(n)]

# Generate a test set of 100 symbols
symbols = generate_symbols(100)

total = len(symbols)
correct = 0
for symbol in symbols:
    expected = unicode_case(symbol)
    actual = unicode_case_openai(symbol)
    if expected == actual:
        correct += 1

print(f'Correct: {correct}/{total} ({correct/total*100:.2f}%)')

This script generates a list of 100 random Unicode characters and compares the case (upper, lower, or unknown) of each symbol as determined by the Python unicodedata library with the case determined by an OpenAI API call. The percentage of matches between these two methods is then calculated and printed. (note: this example is not an eval, just a script to capture the essence of your request to make it more verbose)

It looks fine on the surface, but there one issue here:

def generate_symbols(n):
    return [chr(random.randint(0, 10000)) for _ in range(n)]

# Generate a test set of 100 symbols
symbols = generate_symbols(100)

The issue is that we won’t know if the tests are comparable when we use the generate_symbols() function multiple times for multiple tests. We can make the test repeatable by moving the function outside the script. We can then use it to generate the symbols list:

# Test set of symbols
symbols = ['a', 'A', 'b', 'B', 'α', 'Α', '1', '!', ' ', 'ζ', 'Δ', and so on]

The tests will now be the same when this is inserted in our original script instead of the generate_symbols() function.

I’m leaning towards yes, but you have to be very careful about repeatability.

I hope this helps answer your question :laughing:

1 Like

Thanks for the example. It is nice to see.

It would be nice to hear back from the OpenAI staff on this so that I can decide if it is worth the effort. After posting the question I never pursued it with more effort because it would just be wasted effort if the evals were not used to train/adjust the production LLMs and then only if the LLMs gave consistent answers, think temperature 0.