Guide on converting ChatCompletionChunk to content parts?

So we can do this easily with the Responses API, but I’m also using models from other LLM providers and accessing them through the OpenAI SDK, so I need to use chat.completions as well. Is there any guide on this?

I do have a working version, but I’d like to see how other people are doing it.

pre: choice_celta is an unlikely name.

Chat completions does not emit typed application “events” on top of SSE.

It simply has a stream of deltas that need to be collected (or displayed) by an iterator as message.content, and tool calls that need to be accumulated back from the partial objects. So what you write is more like “collector” than “event handler”. Not a “deal with any stream from the SDK”, as that would have challenges.

I don’t have any non-proprietary example to readily share. You should target Gemini Chat Completions compatibility layer for feature resilience, as they “solved” encrypted reasoning and reasoning in tool calls, along with passing thinking summaries and incremental token usage reports. OpenAI’s lock-in attempt with Responses to gate model features will ultimately be their lock-out if they don’t follow.

Example shape of SSE chunk 0:

data: {"choices":[{"delta":{"content":"Sure!","extra_content":{"google":{"thought_signature":"CtkeAdHtim9..."}},"role":"assistant"},"index":0}],"created":1764000000,"id":"1315156","model":"gemini-x","object":"chat.completion.chunk","usage":{"completion_tokens":64,"prompt_tokens":180,"total_tokens":1180}}

Tip: “content” and “tool_calls” are not exclusive output modalities; the AI may produce both, needing action.

Yea haha its very annoying that content and tool calls can be in the same chunk, but some llm providers don’t do it.

Rather, what I meant to imply is that the unaware programmer might expect, “This has text content, I do not need to parse for tool_calls”. Whereas the AI has been able to emit both (now called a preamble) for over two years.

What differs is that some may stream delta chunks that are maximally the size of single tokens or character representation, while others may not purposefully try to lose mini packets with maximum bandwidth overhead, and instead will stream sentences or whole tool calls, sized for http streaming chunk size or max packet size. (worse: Responses has figured out how to send the same content multiple times and even echo back inputs and instruction messages).

Yep, makes sense. I think I have something that works, need to do more testing. You’re right… I’ve noticed Gemini and Mistral sometimes send the entire content along with the entire tool info.

I didn’t know responses do that :smiling_face_with_tear: . For claude, when you added thinking before tool_use sometimes it echos back the reasoning as text part.

It’s also strange that thought_signature exists even when no reasoning effort was applied, but maybe Gemini uses a default setting if it’s a reasoning model. From what I’m seeing, it looks like the 2.5 models are the only ones where reasoning can be turned off.

1 Like