I don’t know what “Default ChatGPT”, “ChatGPT-4”, “Legacy ChatGPT” and “ChatGPT turbo” are.
If you’re referring to requests submitted via API, can you please refer to the actual name of the models in the model parameter so we can help you understand the differences?
“Default ChatGPT”, “ChatGPT-4”, “Legacy ChatGPT” are there ChatGPT version for Plus user. And the ChatGPT API model I take are gpt-3.5-turbo and gpt-3.5-turbo-0301. The API and Legacy ChatGPT have similar wrong answer.
Thanks for the clarification. It makes sense that gpt-4 model is better on average, as it’s the new one. Regarding the others: they are probably similar versions of the same base model, but you might still find differences.
Especially because gpt-3.5-turbo in the API depends on a system message that might not be the one being used in ChatGPT interface with “Default” model.
Generally, I discourage experimenting via ChatGPT interface in terms of developing and retrieving reproducible results. We have the Playground and the API for this purpose.
The most weird one is that the Legacy and Default ChatGPT does not agree with each other in this specific question, It should be similar, but It give totally contrary answer. I thought it be a bug, not just because of model difference
I see your point. However, these models have not been trained to be factual, but to produce sequences of tokens that sound reliable, given the existing context.
Slight model/parameters differences can result in huge output differences. The accuracy of the generated response is just not a priority for these models (even though this policy can be corrected after pre-training via RLHF and other techniques).