Tool call streaming doesn't stream

_j · November 1, 2024, 2:38am

I expect that I am receiving output tokens at the rate they are being actually produced by the model, at similar output rate as other language calls after an initial delay that is longer. The “function” delay.

This is due to functions and the API parser having the new structured function schema available. It is a concern that has several forum topics, as structured outputs underperforms due to the additional computation.

From its announcement:

Note: the first request you make with any schema will have additional latency as our API processes the schema, but subsequent requests with the same schema will not have additional latency.

It would seem it has additional burden of precomputation on ANY calls, that is not explicitly mentioned, but has continued since introduction.

One can imagine that even for repetitions, the “hash input function object tokens” → “validate against model and schema to see if strict” → “search artifact database” → “return cache hit results” → “load tokenizer grammar” process has overhead that is more dramatic on smaller requests.

Hopefully there are big brains on the task of shaving off another second or two.

Topic		Replies	Views
Multiple function calls with streaming API gpt-4 , function-calling , streaming	6	4903	April 5, 2024
Streaming with recursive function / tools calling API gpt-4 , functions , streaming	13	3810	April 3, 2025
Has anyone managed to get a tool_call working when stream=True? API api , function-calling	22	21151	May 24, 2024
Auto tool call streaming differentiation is unintuitive Feedback api	3	157	May 19, 2025
Different behavior in streaming function calling Bugs	5	697	March 18, 2024

Tool call streaming doesn't stream

Related topics