I’d like to raise some concerns and possible bugs I’ve encountered with the current evals system.
According to the stated policy, users are supposed to receive 7 free weekly evals (excluding tool-use models). However, I’ve noticed that I was billed for some eval runs even though I hadn’t exceeded this weekly limit. Since I use a wide range of models, including GPT-4.5, these runs have sometimes come with unexpectedly high costs. Additionally, I’ve observed that billing sometimes starts partway through a run. I’m not sure if there is a token limit for a single eval run, or if this is simply a delay in the billing process.
Overall, the program feels quite opaque. As an academic researcher working on evals, I was genuinely excited about this program. However, not being able to see how many free evals remain or the cost of each run makes it very challenging to manage my usage and avoid unforeseen charges. Unfortunately, this lack of transparency has already resulted in fees of about $1,000, which is a significant amount for a PhD.
I believe that greater clarity would benefit both users and OpenAI. I hope these issues can be addressed to make the system more accessible and user-friendly for the research community.