Yes the models contain different fine tuning, different context sizes 4, 8 , 16, 32, 128 for example. Turbo models are the latest iteration and have the best price to performance ratio, but are limited to a 4k output size, the non Turbo models are typically symmetric in output context size but are trained to be concise with answers.
Do you have any specific question about which model to use?
the 3.5 turbo 4 and 16k differences are in context size, so one can take 4k token spread over input and output and the 16k model can handle 16k or 4 times as much. The model performance is dictated by your accounts Tier, i.e. 1-5, Tiers 3 and up should have similar performance while Tiers 1 and 2 may be located on servers with a larger latency.
For actual functional performance we have to look at leaderboards as an empirical indication: (if that’s your question @20euai044)
unfortunately there’s not enough evidence to compare 4k vs 16k versions of the same model, but if we assume functional performance to be the same, we get the following results:
Rank
Model
Bootstrap Median of MLE Elo
Bootstrap Median of Online Elo
Input CPKT
Output CPKT
rank 1 win rate
input relative CPKT
output relative CPKT
1
gpt-4-turbo
1249
1249
0.01
0.03
1
0.01
0.03
2
gpt-4-0314
1191
1189
0.03
0.06
0.7
0.042857143
0.085714286
3
gpt-4-0613
1159
1164
0.03
0.06
0.68
0.044117647
0.088235294
4
claude-1
1149
1148
5
mistral-medium
1149
1146
6
claude-2.0
1131
1131
0.008
0.024
0.71
0.011267606
0.033802817
7
mixtral-8x7b-instruct-v0.1
1123
1125
8
gemini-pro-dev-api
1122
1120
0.00025
0.0005
0.78
0.000320513
0.000641026
9
claude-2.1
1118
1119
0.008
0.024
0.76
0.010526316
0.031578947
*10
gpt-3.5-turbo-0613
1116
1116
0.0015
0.002
0.77
0.001948052
0.002597403
*10
gpt-3.5-turbo-0613 16k
1116
1116
0.003
0.004
0.77
0.003896104
0.005194805
11
gemini-pro
1115
1117
12
yi-34b-chat
1110
1110
13
claude-instant-1
1109
1109
0.0008
0.0024
0.74
0.001081081
0.003243243
14
tulu-2-dpo-70b
1106
1105
*15
gpt-3.5-turbo-0314
1105
1106
0.0015
0.002
0.77
0.001948052
0.002597403
16
wizardlm-70b
1104
1110
17
vicuna-33b
1093
1095
18
starling-lm-7b-alpha
1091
1092
19
llama-2-70b-chat
1080
1079
20
openhermes-2.5-mistral-7b
1078
1079
21
openchat-3.5
1077
1078
22
llama2-70b-steerlm-chat
1075
1075
*23
gpt-3.5-turbo-1106
1073
1073
0.001
0.002
0.83
0.001204819
0.002409639
24
pplx-70b-online
1072
1070
25
dolphin-2.2.1-mistral-7b
1065
1071
26
solar-10.7b-instruct-v1.0
1064
1064
27
wizardlm-13b
1057
1055
28
zephyr-7b-beta
1049
1047
relative scores computed by cost per kilotoken/rank 1 win rate, although it could be argued that this formula is way too generous.
Concluding, it seems like gpt-3.5-turbo-0613 may offer the best bang for the buck in among the 3.5 series, but it obviously depends on your application.
This should be considered a back of the envelope type of situation, and not authoritative by any means.
That metric is saying that GPT-4 is some 50 points ahead of gpt-3.5 on a scale of 1000… meaning it’s 5% better… anyone who has used GPT-4 knows that it is considerably more than 5% better than 3.5. I question the point of this number.
The upshot is, anyone from outside of the discipline see’s the “Bootstrap Median of MLE Elo” and comes to the conclusion that GPT-4 is only a few points ahead of anything else. Sure if you drill down into what it all means you see the bigger picture, but you can’t tell me that is not a confusing metric for those choosing an LLM to use who are not experts. It’s the kind of number that gets tweets/X’s and I’m still not sure what it’s telling me from the INSIDE!
ELO (actually Elo) is used in a lot of gaming matchmaking environments to see who is closest to whom
In chess for example, Magnus Carlsen has a rating of 2830, while Hikaru Nakamura has a rating of 2788. While percentage wise that’s a tiny difference, I imagine Carlsen will wipe the floor with Nakamura 99% of the time in standard chess (i know very little of chess).
That said, most kids should conceptionally understand ELO these days. Whether they played Starcraft, Counterstrike, Fortnite or whatever; they’ll just call it MMR.
All that said and done, do you know a straightforward, reliable and meaningful score that isn’t getting gamed (all you need is validation data style)?