Fantastic GPT-40 but...where are the docs?

The first thing is that this model uses a different token encoder. If you were counting, you’ll have to grab tiktoken released 15 minutes ago.

MODEL_PREFIX_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4o-": "o200k_base",  # e.g., gpt-4o-2024-05-13
    "gpt-4-": "cl100k_base",  # e.g., gpt-4-0314, etc., plus gpt-4-32k

Found that max_tokens and usage of 256 is counted at 258-268 by cl100k-base of my script, a hair more efficient on English:

However 512 gpt-4o tokens of Japanese is 702 tokens of cl100k (GPT-4) output.

Day 0 speed (110tps is about where gpt-3.5-turbo-instruct maxed out hours after release)

For 3 trials of gpt-4-0125-preview @ 2024-05-13 11:03AM:

Stat Minimum Maximum Average
stream rate Min: 24.4 Max: 31.8 Avg: 27.267
latency (s) Min: 0.78 Max: 1.103 Avg: 0.900
total response (s) Min: 8.848 Max: 11.2203 Avg: 10.373
total rate Min: 22.816 Max: 28.933 Avg: 24.972
response tokens Min: 256 Max: 256 Avg: 256.000

For 3 trials of gpt-4o @ 2024-05-13 11:03AM:

Stat Minimum Maximum Average
stream rate Min: 107.6 Max: 112.9 Avg: 110.200
latency (s) Min: 0.3701 Max: 0.525 Avg: 0.434
total response (s) Min: 2.6818 Max: 2.8598 Avg: 2.773
total rate Min: 90.216 Max: 96.204 Avg: 93.342
response tokens Min: 258 Max: 260 Avg: 258.667

The response is in about 5% less streaming chunks.