How do you assess the capabilities of new models?

I’m curious how other developers and users evaluate what a new model can and cannot do when it first launches.

One of my own tests is asking each new model to generate a small Pygame-CE script that uses the experimental SDL-2 video module (which is fantastic by the way). The docs exist and there are a few public examples, yet many models still fall back to outdated patterns or outright confabulate the whole reply. It is only a single datapoint, but it shows how reliably a model handles niche or sparsely documented scenarios in a zero-shot setting.

What’s your approach? Do you use domain-specific tasks, known benchmarks, edge-case prompts, or something else?

1 Like

You mean like how I would ask it to write me a cipher and see if it’s solvable, to check how well it’s memory and logic is running?

1 Like

Yes, that is one approach.
I can also run my existing services with another model and compare latency, costs and quality as overall metrics.

But I am always looking for use cases that were previously out of reach or only possible with extra model calls or scaffolding.

I’ll keep some weekend momentum going.

Vibes.

AI 1:

(ChatGPT is not amused.)

AI 2:

Quality

More AI 2:

AI 1:

That is, quality without corrections.

In my practical experience, High % IPA makes tree sap, even turned to amber, just melt away. ChatGPT was run first; not me being informed by Claude.

Plus, strange talking sticks…

Good/Bad reinforcement learning time … when there’s a good B produced.

Comprehension

This is a pervasive regression in GPT-5 - simply not being understood, or not having the intention understood, or even the intention of code completely overlooked.

1 Like