I expect that I am receiving output tokens at the rate they are being actually produced by the model, at similar output rate as other language calls after an initial delay that is longer. The “function” delay.
This is due to functions and the API parser having the new structured function schema available. It is a concern that has several forum topics, as structured outputs underperforms due to the additional computation.
From its announcement:
Note: the first request you make with any schema will have additional latency as our API processes the schema, but subsequent requests with the same schema will not have additional latency.
It would seem it has additional burden of precomputation on ANY calls, that is not explicitly mentioned, but has continued since introduction.
One can imagine that even for repetitions, the “hash input function object tokens” → “validate against model and schema to see if strict” → “search artifact database” → “return cache hit results” → “load tokenizer grammar” process has overhead that is more dramatic on smaller requests.
Hopefully there are big brains on the task of shaving off another second or two.