While I am definitely a proponent of evals I think it’s challenging for somebody who is using mainly if not only ChatGPT to recreate the failure mode and the eval accordingly.
It would be great to have some guidance about how to set temperature and top-p, or even more fancy, to export single examples directly from the UI.
Currently the best option for a ChatGPT users is to hit the thumbs down button.
Imagine with 100 million users a month, what if 1% of users actually found something worth looking into, even if it’s just a potential candidate for the Journal of Negative Results.