The first thing is that this model uses a different token encoder. If you were counting, you’ll have to grab tiktoken released 15 minutes ago.
MODEL_PREFIX_TO_ENCODING: dict[str, str] = {
# chat
"gpt-4o-": "o200k_base", # e.g., gpt-4o-2024-05-13
"gpt-4-": "cl100k_base", # e.g., gpt-4-0314, etc., plus gpt-4-32k
Found that max_tokens and usage of 256 is counted at 258-268 by cl100k-base of my script, a hair more efficient on English:
However 512 gpt-4o tokens of Japanese is 702 tokens of cl100k (GPT-4) output.
Day 0 speed (110tps is about where gpt-3.5-turbo-instruct maxed out hours after release)
For 3 trials of gpt-4-0125-preview @ 2024-05-13 11:03AM:
Stat | Minimum | Maximum | Average |
---|---|---|---|
stream rate | Min: 24.4 | Max: 31.8 | Avg: 27.267 |
latency (s) | Min: 0.78 | Max: 1.103 | Avg: 0.900 |
total response (s) | Min: 8.848 | Max: 11.2203 | Avg: 10.373 |
total rate | Min: 22.816 | Max: 28.933 | Avg: 24.972 |
response tokens | Min: 256 | Max: 256 | Avg: 256.000 |
For 3 trials of gpt-4o @ 2024-05-13 11:03AM:
Stat | Minimum | Maximum | Average |
---|---|---|---|
stream rate | Min: 107.6 | Max: 112.9 | Avg: 110.200 |
latency (s) | Min: 0.3701 | Max: 0.525 | Avg: 0.434 |
total response (s) | Min: 2.6818 | Max: 2.8598 | Avg: 2.773 |
total rate | Min: 90.216 | Max: 96.204 | Avg: 93.342 |
response tokens | Min: 258 | Max: 260 | Avg: 258.667 |
The response is in about 5% less streaming chunks.