Usage stats now available when using streaming with the Chat Completions API or Completions API

owencmoore · May 6, 2024, 11:40pm

When streaming with the Chat Completions or Completions APIs you can now request an additional chunk to be streamed at the end that will contain the “usage stats” like the number of tokens generated in the entire completion. Previously this usage data was not available when using streaming.

Just set stream_options: {"include_usage": true} (API reference) in your request and you will receive an additional final response chunk containing the usage data for your entire request / response.

Important:

Note that this usage-specific chunk will have choices: [], so if you turn this feature on, you may need to update any code that accesses choices[0] to first check if it is the usage chunk.
Additionally, note that all the normal chunks that appear earlier in the response will contain usage: null

You can see an example of using this in our cookbook, and here is an illustrative example:

// Request: POST v1/chat/completions
{
  "model": "gpt-4-turbo",
  "messages": [
    {"role": "user", "content": "Hi! How are you?"}
  ],
  "stream": true,
  // NEW: stream_options param enables the usage field and chunk.
  "stream_options": {
    "include_usage": true
  }
}

// Streamed chunks in the response. 
// Note that since we use Server-sent Events for streaming, we recommend using either our SDK or a library designed for Server-sent Events in the language you are using to parse these events.

{
  "id": "chatcmpl-2EHCQqsRzdOlFskNehCMu2oOMTXhSjey",
  "object": "chat.completion",
  "created": 1693600000,
  "model": "gpt-4-turbo",
  "choices": [{
    "index": 0,
    "delta": {
      "role": "assistant"
    },
    "finish_reason": null
  }],
  // NEW: since the initial request included `stream_options: {"include_usage": true}`, all streamed chunks will include `usage: null` (except the final one which will include the usage data for the entire completion)
  "usage": null
}

{
  "id": "chatcmpl-Z5gqKcSESta3pKtJz4tO8VZxw3yv9bmI",
  "object": "chat.completion",
  "created": 1693600020,
  "model": "gpt-4-turbo",
  "choices": [{
    "index": 0,
    "delta": {
      "content": "I'm doing well thanks, how are you?"
    },
    "finish_reason": null
  }],
  "usage": null
}

{
  "id": "chatcmpl-Qh9PxX34rkVI6eFgK2oL5yMuXaFb6yLJ",
  "object": "chat.completion",
  "created": 1693600040,
  "model": "gpt-4-turbo",
  "choices": [{
    "index": 0,
    "delta": {},
    "finish_reason": "stop"
  }],
  "usage": null
}

// NEW: you will now receive a chunk with the usage data as the final chunk if you set `stream_options: {"include_usage": true}` in the original request
{
  "id": "chatcmpl-3LFz2VTgjsVxv5kPI3K3e2MwJOFr6V2c",
  "object": "chat.completion",
  "created": 1693600060,
  "model": "gpt-4-turbo",
  // NEW: empty choices list in the last chunk
  "choices": [],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 10,
    "total_tokens": 16
  }
}

_j · May 7, 2024, 1:35am

Available in Python openai >= 1.26.0

Worth noting:

the finish reason will be in the second to last chunk now
the last usage chunk will have no “choices” contents that you might have been parsing
stream_options={"include_usage": True} if using library parameters (capital True)

Presentation:

Finish reason chunk

{
  "id": "chatcmpl-9M3...",
  "choices": [
    {
      "delta": {
        "content": null,
        "function_call": null,
        "role": null,
        "tool_calls": null
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "created": 1715044805,
  "model": "gpt-3.5-turbo-0125",
  "object": "chat.completion.chunk",
  "system_fingerprint": null,
  "usage": null
}

Final chunk

{
  "id": "chatcmpl-9M3...",
  "choices": [],
  "created": 1715044805,
  "model": "gpt-3.5-turbo-0125",
  "object": "chat.completion.chunk",
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 11,
    "prompt_tokens": 29,
    "total_tokens": 40
  }
}

Some response-scraping logic

response = client.chat.completions.with_raw_response.create(
    **your_parameter_dict)
content = ""
for chunk in response.parse():
    print(json.dumps(chunk.model_dump(), indent=2))
    if chunk.choices:
        if not chunk.choices[0].finish_reason:
            word = chunk.choices[0].delta.content or ""
            content += word
            print(word, end ="")  # your method
            if chunk.choices[0].delta.function_call:
                function_call += chunk.choices[0].delta.function_call
            if chunk.choices[0].delta.tool_calls:
                tool_calls += chunk.choices[0].delta.tool_calls
        else:
            finish_reason = chunk.choices[0].finish_reason
    if chunk.usage:
        usage_dict = chunk.usage

(the gathered chunks of tools and functions will need to be reassembled)

Diet · May 7, 2024, 2:30am

Cool stuff.

Question:

Is there a reason you did it like this, as opposed to including prompt tokens in (or before) the first chunk and completion tokens with each chunk?

The way you did it can’t be used if a generation is canceled by the client (which is quite common when intercepting or preempting model errors)

ritchie.karl · May 7, 2024, 2:44pm

Nice feature.

Looks like it’s not working for AsyncAzureOpenAI.
Is this intended and support for Azure will be added later or is it possibly a bug ?

ElevenHeights · May 8, 2024, 1:46am

My usage is always None, is there anything wrong here?

response = openai_client.chat.completions.create(
		model="gpt-4-turbo-2024-04-09",
		messages=prompt,
		temperature=1,
		max_tokens=1700,
		tools=tools,
		tool_choice=tool_choice,
		top_p=1,
		frequency_penalty=0,
		presence_penalty=0,
		stream=True,
		stream_options={
			"include_usage": True
		}
	)

for chunk in response:        
		if chunk.usage:
			usage_dict = chunk.usage

_j · May 8, 2024, 3:36am

usage_dict = chunk.usage

This will still be “pydantic”. You can use dict(chunk.usage) to get a “clean” dict, where you then can get values by key:
output_cost = usage_dict['completion_tokens'] * model cost / 1000
for example.

(you’ll have to trust me that the stream made pokey emoji one-at-a-time…)

nimobeeren · May 8, 2024, 2:16pm

Azure typically lags behind a bit on releases. You can see in the latest Azure OpenAI API spec that the stream_options field is not supported yet.

We’ll have to wait for them to make a new release and update the api_version param on our side.

sashirestela · May 8, 2024, 2:52pm

@nimobeeren I suppose it is the same for the Assistants API v2 on Azure OpenAI. Don’t they tell its community when they plan to support the latest OpenAI features?

nimobeeren · May 8, 2024, 3:08pm

There doesn’t really seem to be a lot of communication on roadmaps coming from MS. It’s basically take what you get.

d.ugarenko · May 10, 2024, 1:46pm

If I cancel streaming data halfway through, how are the costs calculated, only for tokens sent and 50% of tokens received?

_j · May 10, 2024, 1:53pm

If you close a streaming connection on Chat Completions with actual network client close, the AI generation and billing of output should stop shortly thereafter.

You of course would not get a stream chunk that says the cost. The usage page used to have per 5-minute-window information for cost discovery, but no more.

fedme · May 18, 2024, 3:38pm

The stream_options param doesn’t seem to be available on the Assistants createRun API (https://platform.openai.com/docs/api-reference/runs/createRun). Am I missing something? If I add that, the API returns 400 bad request.

Could you please add the same option to the Assistants API? Is that on the roadmap?

_j · May 18, 2024, 3:50pm

Discerning when the methods described in this announcement would be useful to you, and on what endpoints, takes careful reading of the first sentence of the first post:

Assistants has its own usage methods that are not discussed in this topic, but are available for review in the API Reference and Documentation links in your platform account or on the sidebar of the forum.

fedme · May 18, 2024, 9:12pm

Thanks, you’re right, I confused the 2 endpoints.
If anybody makes the same mistake, for the Assistant API you will receive the usage of the Run as part of the thread.run.completed and thread.run.step.completed streaming events.

don8 · June 11, 2024, 11:34am

This doesn’t seem to be supported in the Javascript library yet. Any ideas? I’m using the 4.50.0 version.

conciergeai · August 3, 2024, 4:21pm

Same for me, I made a post in detail:
https://community.openai.com/t/issue-with-token-usage-in-streaming-responses/892449?u=conciergeai

m18520796097 · August 7, 2024, 9:32am

me too
577DF509-4FAC-49A4-8002-EC563A9346B0
A6B77BEB-9C03-4FC2-AA18-CA2FCFDDC635

conciergeai · August 14, 2024, 12:55am

The same is happening to me:

albinvadakkeal96 · August 21, 2024, 5:02am

I am also facing the same issue for model gpt-4-turbo-2024-04-09,any update on this issue

Topic		Replies	Views
Why there is no USAGE object returned with Streaming Api Call? API api , chat-completion , completions	20	5320	February 20, 2025
OpenAi API - get usage tokens in response when set stream=True API	31	38765	August 3, 2024
Issue with Token Usage in Streaming Responses Bugs api	17	1130	February 21, 2025
Azure .net OpenAI Client: Usage disabled for streamed completion API azure , azure-openai	1	640	September 3, 2024
Stream_options not working for image_url contents Bugs api , streaming	4	241	December 24, 2024

Usage stats now available when using streaming with the Chat Completions API or Completions API

Finish reason chunk

Final chunk

Related topics