Why there is no USAGE object returned with Streaming Api Call?

ozan.adiguzel · September 20, 2023, 5:22pm

So there is no usage object returned when chat.completion api request made with streaming(chat.completion.chunch). Example usage object is added from regular request

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0613",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

What is the reason for that missing from streaming api calls? Or am I missing something or doing it wrong to not receive it?

Foxalabs · September 20, 2023, 5:33pm

Usage stats are not included when streaming, I think mostly down to the difficulty of knowing when a stream might be terminated from the other side, at what point do you send the usage stats ? You can use tiktoken to count the tokens in the response deltas.

ozan.adiguzel · September 20, 2023, 5:55pm

Yes, I am aware of tiktoken and other token counter libs. We are using node.js so I wasn’t able to use tiktoken since it is not available in node.js but I used gpt-encoder lib instead. Thanks. I still think it will be useful if GPT returns at the end when stream is completed with finish_reason = stop

Foxalabs · September 20, 2023, 6:05pm

Ahh, you might find this of use

wclayf · September 20, 2023, 7:13pm

OpenAI really should consider adding a StreamGUID parameter where we can make a call that asks how much was consumed for any streaming attempt, up to like 1 or 2 minutes after the streaming completes. Using tiktoken, even if it works fine, is not ideal.

wstein · September 21, 2023, 6:26pm

I also hit this problem in https://cocalc.com’s ChatGPT integration. Here’s the few lines of node.js code I used to handle streaming and compute the usage:

github.com

sagemathinc/cocalc/blob/master/src/packages/server/openai/chatgpt.ts#L340


      
                  throw err;
                }
                await delay(1000);
              }
            }
            throw Error("chatgpt api called failed"); // this should never get reached.
          }
          
          // a little bit of this code is replicated in
          // packages/frontend/misc/openai.ts
          const APPROX_CHARACTERS_PER_TOKEN = 8;
          const tokenizer = new GPT3Tokenizer({ type: "gpt3" });
          function numTokens(content: string): number {
            // slice to avoid extreme slowdown "attack".
            return tokenizer.encode(content.slice(0, 8000 * APPROX_CHARACTERS_PER_TOKEN))
              .text.length;
          }
          function totalNumTokens(messages: { content: string }[]): number {
            let s = 0;
            for (const { content } of messages) {
              s += numTokens(content);

I used the gpt3-tokenizer library, which seems to be yet another library in addition to the ones mentioned above, for solving this problem: gpt3-tokenizer - npm

ozan.adiguzel · September 21, 2023, 6:52pm

So I am aware of some 3rd party libs as well but I was hesitant using some of them even if they are openSource. I handle it with this one recommend in OpenAI cookbook or in tiktoken readme , which is gpt-3 encoder . My problem is that how accurate this libraries are and they will be up to date with gpt side changes if there is some. Thats why it will be alot better to use results returned from GPT itself, such as in regular completion api.

When we look at this table from tiktoken, we can see that gpt-3-encoder doesn’t even support cl100k_base encoding.

Encoding name	OpenAI models
cl100k_base	gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base	Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2)	GPT-3 models like davinci

wstein · September 21, 2023, 7:32pm

Thanks for pointing that out! Now you’re scaring me, and this is my new plan: change our token counter for GPT-3.5/GPT-4 · Issue #6933 · sagemathinc/cocalc · GitHub

Foxalabs · September 21, 2023, 10:13pm

Just to give a heads up, 8 chars per token as a guestimate is double what it should be, an average word is 0.75 tokens, making that value 4 not 8.

Anyhoo, where possible try to use the various tiktoken ports that are out there on github.

ozan.adiguzel · October 13, 2023, 1:41pm

Can you share some that you trust?

Foxalabs · October 13, 2023, 2:00pm

I only have experience with the official tiktoken library here GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.

ozan.adiguzel · October 19, 2023, 10:02pm

So I see that this is recommended by OpenAI in their token counting page now (scroll all the way to bottom). This is also equivalent to tiktoken in python and handles it with cl100k_base encoding with gpt3-5, gpt4 supports.

johnda98 · December 1, 2023, 5:01pm

Any github issue or any update on this thread ? Dec 1.

Token usage that precisely mirrors what the key would be charged is critical for any layer 2 billing Apps for my Clients.

_j · December 1, 2023, 5:27pm

With chat completion, you can absolutely measure the input tokens and response you receive yourself, by use of a library such as tiktoken, and then adding token count metadata to accounts and to the chat history messages for utility. One only needs to add the fixed per call/per message overhead to the inputs sent, and function specification size can be measured by switching them off on a non-stream call. It is only occasional failures that still get billed you have to allow for.

With assistants, you absolutely have no idea what the agent has been up to until the daily charges start showing up on your account.

logankilpatrick · December 1, 2023, 6:32pm

Quick update on this: we have a version working that we are testing but are not yet happy with the overall design (there’s a lot of complexity in streaming and billing together). Hopefully this is something we can land for you all soon. Stay tuned!

johnda98 · December 1, 2023, 10:19pm

Super - thanks Logan.

I’m half way through with a layer2 build from Hedera Hashgraphs’ SDK - java using a java2py lib to enable HBAR- a highest ABFT consensus security measure- a public ledger.

To enable my Customers’ Solar industry Tech helped chat app to be able to bill their customers in HBAR - $ terms layered ontop of my Customer’s OpenAI pkey account.

Tokenizing usage in effect with a margin.

When this is done I will have 20hrs a week coming available - in case you know of any AI projects that need help. ty.

last 4 months studied and using Langchain’s libs, and Langsmith to tune prompts and chains. Databutton is kinda pretty but not deep enough as a streamlit based chat Dev tool IDE…

eliran · December 5, 2023, 2:50pm

Great. We hesitate to base our billing on any 3rd party library, so as long as OpenAI is capable of billing us for usage, we must have a proper API way to grab the same data from the API.

quietopus · February 17, 2024, 4:55am

Ive noticed that when you set logprobs to true, even when streaming, the logprobs property on each ChatCompletionChunk.Choice contains entries which each have a property called token, containing a snippet of text from the content. That is, the naming suggests that there is one entry in logprobs for each token.

If this is true, then we can probably total up the response tokens (if not the request tokens) by adding the total number entries in all logprobs arrays?

In my own tests, I’ve also noticed that, when streaming, by the above reasoning, there does appear to be only one token per chunk. That is, every logprobs I’ve seen so far in a ChatCompletionChunk.Choice had only one entry, with a single “token”. If that’s true, then you can perhaps, as others have said, just add up the number of chunks (though some chunks contain no content and no logprobs).

_j · February 17, 2024, 6:53am

There is mostly one element per chunk. You cannot measure tokens by it, except to estimate a minimum.

However, it does seem to be a feasible concept to look at logprobs for some utility, because we get a list of logprobs per chunk.

Logprobs are blocked, however, when tool_call is emitted, so it is not a complete solution:

tool chunk_no: 0
Traceback (most recent call last):
    for index, prob in enumerate(chunk.choices[0].logprobs.content):
TypeError: 'NoneType' object is not iterable

gpt-4-vision = no logprobs either

Specifying tools to gpt-4-turbos steals extra tokens from content output because of shady tool tricks to prevent control.

For an example where “chunks” does not equal tokens, let’s just get emojis, and make a clear presentation from logprobs within:

chunk_no: 0

chunk_no: 1
0: [240, 159, 152]
1: [128]

chunk_no: 2
0: [240, 159, 152]
1: [131]

chunk_no: 3
0: [240, 159, 152]
1: [132]

chunk_no: 4
0: [240, 159, 152]
1: [129]

chunk_no: 5
0: [240, 159, 152]
1: [134]

chunk_no: 6

response content:
😀😃😄😁😆
{'tool_calls': []}

The AI writes a tool call, though? No token count for you! Showing here:

chunk_no: 0
0: ('content', None)

chunk_no: 1

tools content:

{‘tool_calls’: [{‘id’: ‘call_idnumber’, ‘function’: {‘arguments’: ‘’, ‘name’: ‘get_random_int’}, ‘type’: ‘function’}]}

(unseen overhead of tool_calls and max_tokens cuts off arguments)

Parsing code snippet, Python


response = your streaming API call, .chat.completions.with_raw_response.create()
reply=""
tools=[]

for chunk_no, chunk in enumerate(response.parse()):    # with_raw_response.create
    print(f"\nchunk_no: {chunk_no}")
    if chunk.choices[0].delta.content:                 # if chunks with assistant
        reply += chunk.choices[0].delta.content        # gather for chat history
        for index, prob in enumerate(chunk.choices[0].logprobs.content):
            print(index, end=': '); print(prob.bytes, end='\n')
    if chunk.choices[0].delta.tool_calls:              # if chunks with tool call
        for index, prob in enumerate(chunk.choices[0].logprobs):          # None
            print(index, end=': '); print(prob, end='\n')
        tools += chunk.choices[0].delta.tool_calls # gather ChoiceDeltaToolCall list

tools_obj = tool_list_to_tool_obj(tools)     # forum search: "messy tool deltas"
print("\nresponse content:\n" + reply)
print(tools_obj)

andres.barbaro · February 23, 2024, 3:23am

Any news on this? Is it closer to being released?

Topic		Replies	Views
OpenAi API - get usage tokens in response when set stream=True API	31	38763	August 3, 2024
Usage stats now available when using streaming with the Chat Completions API or Completions API API api , api-usage , streaming	25	18304	January 23, 2025
What is this new what is this new what is this new API gpt-4	18	18830	January 11, 2024
Issue with Token Usage in Streaming Responses Bugs api	17	1130	February 21, 2025
Chat completion "stream" API token usage API api	3	6440	May 6, 2024

Why there is no USAGE object returned with Streaming Api Call?

Related topics