The code I just reposted again is producing results like:
and
It’s now even throwing in misaligned answers at times even when steered with the You are gpt-4-turbo-2024-04-09 system prompt. Which, again, proves the point I’ve been trying to make here on just how wonky the model’s functionality is, and how the problem is accumulating – like the previous commenter mentioned, this is “nothing new”, but in the sense of becoming worse and more prominent. the factual errors are just piling up… again, if you don’t care about anything factual, then it might not be as big of a problem. That doesn’t negate the fact the model is off kilter somehow.
A lot of things can change over months as you can see from the example. If you’re completely unconcerned about the model’s factual accuracy (or any sort of accuracy) at a base level, then I suppose it’s not a problem, “who cares?”
What I’m worried about on top of the factual misrepresentation is that it’ll get more skewed in anchoring itself on false axioms etc. and that it’s not just getting the cutoff date wrong, and hence facts within the time frame wrong, hence this can show up in other benchmarks and performance metrics of the model, like already stated in these initial benchmarks like I have already mentioned in this thread:
The mentioned problems the model is exhibiting in the test metrics is something that can and likely will show itself in other capabilities of the model; again, see link above.

