Usage stats now available when using streaming with the Chat Completions API or Completions API

When streaming with the Chat Completions or Completions APIs you can now request an additional chunk to be streamed at the end that will contain the “usage stats” like the number of tokens generated in the entire completion. Previously this usage data was not available when using streaming.

Just set stream_options: {"include_usage": true} (API reference) in your request and you will receive an additional final response chunk containing the usage data for your entire request / response.

Important:

  • Note that this usage-specific chunk will have choices: [], so if you turn this feature on, you may need to update any code that accesses choices[0] to first check if it is the usage chunk.
  • Additionally, note that all the normal chunks that appear earlier in the response will contain usage: null

You can see an example of using this in our cookbook, and here is an illustrative example:

// Request: POST v1/chat/completions
{
  "model": "gpt-4-turbo",
  "messages": [
    {"role": "user", "content": "Hi! How are you?"}
  ],
  "stream": true,
  // NEW: stream_options param enables the usage field and chunk.
  "stream_options": {
    "include_usage": true
  }
}

// Streamed chunks in the response. 
// Note that since we use Server-sent Events for streaming, we recommend using either our SDK or a library designed for Server-sent Events in the language you are using to parse these events.

{
  "id": "chatcmpl-2EHCQqsRzdOlFskNehCMu2oOMTXhSjey",
  "object": "chat.completion",
  "created": 1693600000,
  "model": "gpt-4-turbo",
  "choices": [{
    "index": 0,
    "delta": {
      "role": "assistant"
    },
    "finish_reason": null
  }],
  // NEW: since the initial request included `stream_options: {"include_usage": true}`, all streamed chunks will include `usage: null` (except the final one which will include the usage data for the entire completion)
  "usage": null
}

{
  "id": "chatcmpl-Z5gqKcSESta3pKtJz4tO8VZxw3yv9bmI",
  "object": "chat.completion",
  "created": 1693600020,
  "model": "gpt-4-turbo",
  "choices": [{
    "index": 0,
    "delta": {
      "content": "I'm doing well thanks, how are you?"
    },
    "finish_reason": null
  }],
  "usage": null
}

{
  "id": "chatcmpl-Qh9PxX34rkVI6eFgK2oL5yMuXaFb6yLJ",
  "object": "chat.completion",
  "created": 1693600040,
  "model": "gpt-4-turbo",
  "choices": [{
    "index": 0,
    "delta": {},
    "finish_reason": "stop"
  }],
  "usage": null
}

// NEW: you will now receive a chunk with the usage data as the final chunk if you set `stream_options: {"include_usage": true}` in the original request
{
  "id": "chatcmpl-3LFz2VTgjsVxv5kPI3K3e2MwJOFr6V2c",
  "object": "chat.completion",
  "created": 1693600060,
  "model": "gpt-4-turbo",
  // NEW: empty choices list in the last chunk
  "choices": [],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 10,
    "total_tokens": 16
  }
}
16 Likes

Available in Python openai >= 1.26.0

Worth noting:

  • the finish reason will be in the second to last chunk now
  • the last usage chunk will have no “choices” contents that you might have been parsing
  • stream_options={"include_usage": True} if using library parameters (capital True)

Presentation:

Finish reason chunk

{
  "id": "chatcmpl-9M3...",
  "choices": [
    {
      "delta": {
        "content": null,
        "function_call": null,
        "role": null,
        "tool_calls": null
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "created": 1715044805,
  "model": "gpt-3.5-turbo-0125",
  "object": "chat.completion.chunk",
  "system_fingerprint": null,
  "usage": null
}

Final chunk

{
  "id": "chatcmpl-9M3...",
  "choices": [],
  "created": 1715044805,
  "model": "gpt-3.5-turbo-0125",
  "object": "chat.completion.chunk",
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 11,
    "prompt_tokens": 29,
    "total_tokens": 40
  }
}
Some response-scraping logic
response = client.chat.completions.with_raw_response.create(
    **your_parameter_dict)
content = ""
for chunk in response.parse():
    print(json.dumps(chunk.model_dump(), indent=2))
    if chunk.choices:
        if not chunk.choices[0].finish_reason:
            word = chunk.choices[0].delta.content or ""
            content += word
            print(word, end ="")  # your method
            if chunk.choices[0].delta.function_call:
                function_call += chunk.choices[0].delta.function_call
            if chunk.choices[0].delta.tool_calls:
                tool_calls += chunk.choices[0].delta.tool_calls
        else:
            finish_reason = chunk.choices[0].finish_reason
    if chunk.usage:
        usage_dict = chunk.usage

(the gathered chunks of tools and functions will need to be reassembled)

2 Likes

Cool stuff.

Question:

Is there a reason you did it like this, as opposed to including prompt tokens in (or before) the first chunk and completion tokens with each chunk?

The way you did it can’t be used if a generation is canceled by the client (which is quite common when intercepting or preempting model errors) :confused:

1 Like

Nice feature.

Looks like it’s not working for AsyncAzureOpenAI.
Is this intended and support for Azure will be added later or is it possibly a bug ?

My usage is always None, is there anything wrong here?

response = openai_client.chat.completions.create(
		model="gpt-4-turbo-2024-04-09",
		messages=prompt,
		temperature=1,
		max_tokens=1700,
		tools=tools,
		tool_choice=tool_choice,
		top_p=1,
		frequency_penalty=0,
		presence_penalty=0,
		stream=True,
		stream_options={
			"include_usage": True
		}
	)

for chunk in response:        
		if chunk.usage:
			usage_dict = chunk.usage
1 Like
usage_dict = chunk.usage

This will still be “pydantic”. You can use dict(chunk.usage) to get a “clean” dict, where you then can get values by key:
output_cost = usage_dict['completion_tokens'] * model cost / 1000
for example.

image

(you’ll have to trust me that the stream made pokey emoji one-at-a-time…)

Azure typically lags behind a bit on releases. You can see in the latest Azure OpenAI API spec that the stream_options field is not supported yet.

We’ll have to wait for them to make a new release and update the api_version param on our side.

2 Likes

@nimobeeren I suppose it is the same for the Assistants API v2 on Azure OpenAI. Don’t they tell its community when they plan to support the latest OpenAI features?

1 Like

There doesn’t really seem to be a lot of communication on roadmaps coming from MS. It’s basically take what you get.

If I cancel streaming data halfway through, how are the costs calculated, only for tokens sent and 50% of tokens received?

1 Like

If you close a streaming connection on Chat Completions with actual network client close, the AI generation and billing of output should stop shortly thereafter.

You of course would not get a stream chunk that says the cost. The usage page used to have per 5-minute-window information for cost discovery, but no more.

1 Like

The stream_options param doesn’t seem to be available on the Assistants createRun API (https://platform.openai.com/docs/api-reference/runs/createRun). Am I missing something? If I add that, the API returns 400 bad request.

Could you please add the same option to the Assistants API? Is that on the roadmap?

Discerning when the methods described in this announcement would be useful to you, and on what endpoints, takes careful reading of the first sentence of the first post:

Assistants has its own usage methods that are not discussed in this topic, but are available for review in the API Reference and Documentation links in your platform account or on the sidebar of the forum.

Thanks, you’re right, I confused the 2 endpoints.
If anybody makes the same mistake, for the Assistant API you will receive the usage of the Run as part of the thread.run.completed and thread.run.step.completed streaming events.