Indeed,
I’m wondering if the “girls checked the freezer” in your example is a temperature related hallucination, have you tried with a lower temperature?
By rephrasing the question you’re also adding more context for the model to work with, i think it’s a good idea, but it sorta prompts the question:
Are we actually testing the model’s performance, or, are we testing the researchers ability to use it?
This question is something that’s been nagging me as well, the methodology I’ve ended up using for tests are based on instructions provided by one human to another human in an attempt to remove this variable