Difference between old and new model

20euai044 · January 10, 2024, 7:19am

any api code or model difference between old and new model of gpt3.5 turbo 16k version?

all of them are confusing in terms of pricing and their different versions

gpt-3.5-turbo-0301
gpt-3.5-turbo-0613
gpt-3.5-turbo-1106
gpt-3.5-turbo-16k
gpt-3.5-turbo-16k-0613

Foxalabs · January 10, 2024, 11:45am

Hi,

Yes the models contain different fine tuning, different context sizes 4, 8 , 16, 32, 128 for example. Turbo models are the latest iteration and have the best price to performance ratio, but are limited to a 4k output size, the non Turbo models are typically symmetric in output context size but are trained to be concise with answers.

Do you have any specific question about which model to use?

20euai044 · January 11, 2024, 5:51am

thanks @Foxalabs , also

need to know the diff between gpt 3.5 turob 4k vs 16k verson.

-16k version old vs new model . only cost difference or any performance reducing due to low cost

Foxalabs · January 11, 2024, 11:40am

the 3.5 turbo 4 and 16k differences are in context size, so one can take 4k token spread over input and output and the 16k model can handle 16k or 4 times as much. The model performance is dictated by your accounts Tier, i.e. 1-5, Tiers 3 and up should have similar performance while Tiers 1 and 2 may be located on servers with a larger latency.

Diet · January 11, 2024, 1:07pm

With performance you mean non functional aspects:

Time to First Token
Tokens Per Second Generated

For actual functional performance we have to look at leaderboards as an empirical indication: (if that’s your question @20euai044)

unfortunately there’s not enough evidence to compare 4k vs 16k versions of the same model, but if we assume functional performance to be the same, we get the following results:

Rank	Model	Bootstrap Median of MLE Elo	Bootstrap Median of Online Elo	Input CPKT	Output CPKT	rank 1 win rate	input relative CPKT	output relative CPKT
1	gpt-4-turbo	1249	1249	0.01	0.03	1	0.01	0.03
2	gpt-4-0314	1191	1189	0.03	0.06	0.7	0.042857143	0.085714286
3	gpt-4-0613	1159	1164	0.03	0.06	0.68	0.044117647	0.088235294
4	claude-1	1149	1148
5	mistral-medium	1149	1146
6	claude-2.0	1131	1131	0.008	0.024	0.71	0.011267606	0.033802817
7	mixtral-8x7b-instruct-v0.1	1123	1125
8	gemini-pro-dev-api	1122	1120	0.00025	0.0005	0.78	0.000320513	0.000641026
9	claude-2.1	1118	1119	0.008	0.024	0.76	0.010526316	0.031578947
*10	gpt-3.5-turbo-0613	1116	1116	0.0015	0.002	0.77	0.001948052	0.002597403
*10	gpt-3.5-turbo-0613 16k	1116	1116	0.003	0.004	0.77	0.003896104	0.005194805
11	gemini-pro	1115	1117
12	yi-34b-chat	1110	1110
13	claude-instant-1	1109	1109	0.0008	0.0024	0.74	0.001081081	0.003243243
14	tulu-2-dpo-70b	1106	1105
*15	gpt-3.5-turbo-0314	1105	1106	0.0015	0.002	0.77	0.001948052	0.002597403
16	wizardlm-70b	1104	1110
17	vicuna-33b	1093	1095
18	starling-lm-7b-alpha	1091	1092
19	llama-2-70b-chat	1080	1079
20	openhermes-2.5-mistral-7b	1078	1079
21	openchat-3.5	1077	1078
22	llama2-70b-steerlm-chat	1075	1075
*23	gpt-3.5-turbo-1106	1073	1073	0.001	0.002	0.83	0.001204819	0.002409639
24	pplx-70b-online	1072	1070
25	dolphin-2.2.1-mistral-7b	1065	1071
26	solar-10.7b-instruct-v1.0	1064	1064
27	wizardlm-13b	1057	1055
28	zephyr-7b-beta	1049	1047

relative scores computed by cost per kilotoken/rank 1 win rate, although it could be argued that this formula is way too generous.

Concluding, it seems like gpt-3.5-turbo-0613 may offer the best bang for the buck in among the 3.5 series, but it obviously depends on your application.

This should be considered a back of the envelope type of situation, and not authoritative by any means.

Foxalabs · January 11, 2024, 1:13pm

That metric is saying that GPT-4 is some 50 points ahead of gpt-3.5 on a scale of 1000… meaning it’s 5% better… anyone who has used GPT-4 knows that it is considerably more than 5% better than 3.5. I question the point of this number.

Diet · January 11, 2024, 1:16pm

It’s an ELO scale. GPT-4 turbo beats gpt-3.5-turbo-0613 0.77 of the time, you could ‘say’ it’s ‘better’ by a factor of ~4.

Foxalabs · January 11, 2024, 1:16pm

right, but that’s not the number used for the ranking is it?

Diet · January 11, 2024, 1:21pm

The whole thing is based on playing them against each other and compute their relative ELOs.

so there’s no number per se, just which model wins against the others most reliably.

Foxalabs · January 11, 2024, 1:26pm

The upshot is, anyone from outside of the discipline see’s the “Bootstrap Median of MLE Elo” and comes to the conclusion that GPT-4 is only a few points ahead of anything else. Sure if you drill down into what it all means you see the bigger picture, but you can’t tell me that is not a confusing metric for those choosing an LLM to use who are not experts. It’s the kind of number that gets tweets/X’s and I’m still not sure what it’s telling me from the INSIDE!

Diet · January 11, 2024, 1:38pm

Mmh, fair sentiment

ELO (actually Elo) is used in a lot of gaming matchmaking environments to see who is closest to whom

In chess for example, Magnus Carlsen has a rating of 2830, while Hikaru Nakamura has a rating of 2788. While percentage wise that’s a tiny difference, I imagine Carlsen will wipe the floor with Nakamura 99% of the time in standard chess (i know very little of chess).

That said, most kids should conceptionally understand ELO these days. Whether they played Starcraft, Counterstrike, Fortnite or whatever; they’ll just call it MMR.

All that said and done, do you know a straightforward, reliable and meaningful score that isn’t getting gamed (all you need is validation data style)?

don't look

Topic		Replies	Views
GPT-4-Turbo models perform better the older GPT-4 models in LMSys benchmark API gpt-4 , api	14	6426	May 13, 2024
Gpt-4 vs gpt-4-turbo-preview Community api	7	23763	April 9, 2024
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	14563	June 11, 2024
What are the Differences between gpt-3.5-turbo models Documentation gpt-35-turbo	11	30545	December 20, 2023
What criteria are used to determine that newer models are "better" API	1	367	November 17, 2023

Difference between old and new model

Related topics