GPT-3.5-turbo-1106 is worse than 0613 version

aiwl · April 25, 2024, 6:37pm

0613 is ranked 25th on the Lymsys leaderboard, whereas 1106 is ranked 46th. Even 0125 and 0314 are ranked higher.

There’s a sample size of about 50,000+ votes. At that size, it’s hardly subjective. Lymsys works on an ELO system, where users pick the better LLM response out of two LLM responses to their queries. The results can be interpreted as 0163 winning more times over other models than 1106 has, implying that 0163 is superior.

Also, 1106 ranks lower than models that I can literally run on my laptop, whereas this is not the case for 0613.

Topic		Replies	Views
Major Issues in new GPT 3.5 : DO NOT DEPRECATE OLD ONE API gpt-35-turbo , api	24	7896	January 21, 2024
Gpt-3.5-turbo-0125 worst than gpt-3.5-turbo-0613 Feedback	2	2971	February 27, 2024
Has the reasoning ability of the GPT 3.5 API dropped recently? API chatgpt , api	9	1178	December 25, 2023
Has regular gpt-4 model changed for the worse by any chance? Community gpt-4 , hallucinations	12	2067	April 23, 2025
The possible decline in format response and reasoning capabilities of gpt-3.5 series APIs API gpt-35-turbo , api	4	556	May 16, 2024

GPT-3.5-turbo-1106 is worse than 0613 version

Related topics