I’m curious how other developers and users evaluate what a new model can and cannot do when it first launches.
One of my own tests is asking each new model to generate a small Pygame-CE script that uses the experimental SDL-2 video module (which is fantastic by the way). The docs exist and there are a few public examples, yet many models still fall back to outdated patterns or outright confabulate the whole reply. It is only a single datapoint, but it shows how reliably a model handles niche or sparsely documented scenarios in a zero-shot setting.
What’s your approach? Do you use domain-specific tasks, known benchmarks, edge-case prompts, or something else?
In my practical experience, High % IPA makes tree sap, even turned to amber, just melt away. ChatGPT was run first; not me being informed by Claude.
Plus, strange talking sticks…
Good/Bad reinforcement learning time … when there’s a good B produced.
Comprehension
This is a pervasive regression in GPT-5 - simply not being understood, or not having the intention understood, or even the intention of code completely overlooked.