Difference between old and new model

any api code or model difference between old and new model of gpt3.5 turbo 16k version?

all of them are confusing in terms of pricing and their different versions

gpt-3.5-turbo-0301
gpt-3.5-turbo-0613
gpt-3.5-turbo-1106
gpt-3.5-turbo-16k
gpt-3.5-turbo-16k-0613

Hi,

Yes the models contain different fine tuning, different context sizes 4, 8 , 16, 32, 128 for example. Turbo models are the latest iteration and have the best price to performance ratio, but are limited to a 4k output size, the non Turbo models are typically symmetric in output context size but are trained to be concise with answers.

Do you have any specific question about which model to use?

thanks @foxabilo , also

need to know the diff between gpt 3.5 turob 4k vs 16k verson.

-16k version old vs new model . only cost difference or any performance reducing due to low cost

the 3.5 turbo 4 and 16k differences are in context size, so one can take 4k token spread over input and output and the 16k model can handle 16k or 4 times as much. The model performance is dictated by your accounts Tier, i.e. 1-5, Tiers 3 and up should have similar performance while Tiers 1 and 2 may be located on servers with a larger latency.

With performance you mean non functional aspects:

  • Time to First Token
  • Tokens Per Second Generated

For actual functional performance we have to look at leaderboards as an empirical indication: (if that’s your question @20euai044)

unfortunately there’s not enough evidence to compare 4k vs 16k versions of the same model, but if we assume functional performance to be the same, we get the following results:

Rank Model Bootstrap Median of MLE Elo Bootstrap Median of Online Elo Input CPKT Output CPKT rank 1 win rate input relative CPKT output relative CPKT
1 gpt-4-turbo 1249 1249 0.01 0.03 1 0.01 0.03
2 gpt-4-0314 1191 1189 0.03 0.06 0.7 0.042857143 0.085714286
3 gpt-4-0613 1159 1164 0.03 0.06 0.68 0.044117647 0.088235294
4 claude-1 1149 1148
5 mistral-medium 1149 1146
6 claude-2.0 1131 1131 0.008 0.024 0.71 0.011267606 0.033802817
7 mixtral-8x7b-instruct-v0.1 1123 1125
8 gemini-pro-dev-api 1122 1120 0.00025 0.0005 0.78 0.000320513 0.000641026
9 claude-2.1 1118 1119 0.008 0.024 0.76 0.010526316 0.031578947
*10 gpt-3.5-turbo-0613 1116 1116 0.0015 0.002 0.77 0.001948052 0.002597403
*10 gpt-3.5-turbo-0613 16k 1116 1116 0.003 0.004 0.77 0.003896104 0.005194805
11 gemini-pro 1115 1117
12 yi-34b-chat 1110 1110
13 claude-instant-1 1109 1109 0.0008 0.0024 0.74 0.001081081 0.003243243
14 tulu-2-dpo-70b 1106 1105
*15 gpt-3.5-turbo-0314 1105 1106 0.0015 0.002 0.77 0.001948052 0.002597403
16 wizardlm-70b 1104 1110
17 vicuna-33b 1093 1095
18 starling-lm-7b-alpha 1091 1092
19 llama-2-70b-chat 1080 1079
20 openhermes-2.5-mistral-7b 1078 1079
21 openchat-3.5 1077 1078
22 llama2-70b-steerlm-chat 1075 1075
*23 gpt-3.5-turbo-1106 1073 1073 0.001 0.002 0.83 0.001204819 0.002409639
24 pplx-70b-online 1072 1070
25 dolphin-2.2.1-mistral-7b 1065 1071
26 solar-10.7b-instruct-v1.0 1064 1064
27 wizardlm-13b 1057 1055
28 zephyr-7b-beta 1049 1047

relative scores computed by cost per kilotoken/rank 1 win rate, although it could be argued that this formula is way too generous.

Concluding, it seems like gpt-3.5-turbo-0613 may offer the best bang for the buck in among the 3.5 series, but it obviously depends on your application.

This should be considered a back of the envelope type of situation, and not authoritative by any means.

That metric is saying that GPT-4 is some 50 points ahead of gpt-3.5 on a scale of 1000… meaning it’s 5% better… anyone who has used GPT-4 knows that it is considerably more than 5% better than 3.5. I question the point of this number.

It’s an ELO scale. GPT-4 turbo beats gpt-3.5-turbo-0613 0.77 of the time, you could ‘say’ it’s ‘better’ by a factor of ~4.

right, but that’s not the number used for the ranking is it?

The whole thing is based on playing them against each other and compute their relative ELOs.

so there’s no number per se, just which model wins against the others most reliably.

The upshot is, anyone from outside of the discipline see’s the “Bootstrap Median of MLE Elo” and comes to the conclusion that GPT-4 is only a few points ahead of anything else. Sure if you drill down into what it all means you see the bigger picture, but you can’t tell me that is not a confusing metric for those choosing an LLM to use who are not experts. It’s the kind of number that gets tweets/X’s and I’m still not sure what it’s telling me from the INSIDE!

Mmh, fair sentiment

ELO (actually Elo) is used in a lot of gaming matchmaking environments to see who is closest to whom

In chess for example, Magnus Carlsen has a rating of 2830, while Hikaru Nakamura has a rating of 2788. While percentage wise that’s a tiny difference, I imagine Carlsen will wipe the floor with Nakamura 99% of the time in standard chess (i know very little of chess).

That said, most kids should conceptionally understand ELO these days. Whether they played Starcraft, Counterstrike, Fortnite or whatever; they’ll just call it MMR.

All that said and done, do you know a straightforward, reliable and meaningful score that isn’t getting gamed (all you need is validation data style)?

don't look