Mixing streaming chat completions with tool_calls?

Streaming responses and tool calls are both neat features on their own, but I’m noticing some weirdness when I try to employ both at the same time. For one thing it appears tool_calls are generated token by token, just like context. I guess that should not be a surprise, but it does complicate reassembling a well formed tool_calls for invocation by the shell script.

It also seems to me like tool_calls might be more inclined to produce hallucinations than the old function_call, especially while streaming. For example I have repeatedly observed the LM calling a presumably hallucinated “python” tool, and then writing little code blocks to try to test out my tools or leave comments to itself about what it thinks my intentions were.

Does anyone have any tips for getting these features to play nicely together? I’ll post a follow up if I figure out any good tricks.


To be honest, I think tool calls and function calls and all that are mostly gimmicks.

That said, they can make your job a little easier. However, I would still recommend just doing one thing at a time. If you have a function call and a text response, what are you really asking of the model? “asynchronously launch a function, and keep chatting with the user with no new information”?

If you make your tasks more straight forward, you’ll likely run into fewer instances of confusion/hallucination. As a consequence, it might be a good idea to maybe split your task into a function call and a chat response, and handle these separately as opposed to one big prompt. Obviously your mileage will vary depending on what you’re trying to do.