Why there is no USAGE object returned with Streaming Api Call?

So there is no usage object returned when chat.completion api request made with streaming(chat.completion.chunch). Example usage object is added from regular request

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0613",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

What is the reason for that missing from streaming api calls? Or am I missing something or doing it wrong to not receive it?

Usage stats are not included when streaming, I think mostly down to the difficulty of knowing when a stream might be terminated from the other side, at what point do you send the usage stats ? You can use tiktoken to count the tokens in the response deltas.

Yes, I am aware of tiktoken and other token counter libs. We are using node.js so I wasnā€™t able to use tiktoken since it is not available in node.js but I used gpt-encoder lib instead. Thanks. I still think it will be useful if GPT returns at the end when stream is completed with finish_reason = stop

1 Like

Ahh, you might find this of use

2 Likes

OpenAI really should consider adding a StreamGUID parameter where we can make a call that asks how much was consumed for any streaming attempt, up to like 1 or 2 minutes after the streaming completes. Using tiktoken, even if it works fine, is not ideal.

I also hit this problem in https://cocalc.comā€™s ChatGPT integration. Hereā€™s the few lines of node.js code I used to handle streaming and compute the usage:

I used the gpt3-tokenizer library, which seems to be yet another library in addition to the ones mentioned above, for solving this problem: gpt3-tokenizer - npm

So I am aware of some 3rd party libs as well but I was hesitant using some of them even if they are openSource. I handle it with this one recommend in OpenAI cookbook or in tiktoken readme , which is gpt-3 encoder . My problem is that how accurate this libraries are and they will be up to date with gpt side changes if there is some. Thats why it will be alot better to use results returned from GPT itself, such as in regular completion api.

When we look at this table from tiktoken, we can see that gpt-3-encoder doesnā€™t even support cl100k_base encoding.

Encoding name OpenAI models
cl100k_base gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2) GPT-3 models like davinci

Thanks for pointing that out! Now youā€™re scaring me, and this is my new plan: change our token counter for GPT-3.5/GPT-4 Ā· Issue #6933 Ā· sagemathinc/cocalc Ā· GitHub

1 Like

Just to give a heads up, 8 chars per token as a guestimate is double what it should be, an average word is 0.75 tokens, making that value 4 not 8.

Anyhoo, where possible try to use the various tiktoken ports that are out there on github.

Can you share some that you trust?

I only have experience with the official tiktoken library here GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.

So I see that this is recommended by OpenAI in their token counting page now (scroll all the way to bottom). This is also equivalent to tiktoken in python and handles it with cl100k_base encoding with gpt3-5, gpt4 supports.

Any github issue or any update on this thread ? Dec 1.

Token usage that precisely mirrors what the key would be charged is critical for any layer 2 billing Apps for my Clients.

With chat completion, you can absolutely measure the input tokens and response you receive yourself, by use of a library such as tiktoken, and then adding token count metadata to accounts and to the chat history messages for utility. One only needs to add the fixed per call/per message overhead to the inputs sent, and function specification size can be measured by switching them off on a non-stream call. It is only occasional failures that still get billed you have to allow for.

With assistants, you absolutely have no idea what the agent has been up to until the daily charges start showing up on your account.

1 Like

Quick update on this: we have a version working that we are testing but are not yet happy with the overall design (thereā€™s a lot of complexity in streaming and billing together). Hopefully this is something we can land for you all soon. Stay tuned!

12 Likes

Super - thanks Logan.

Iā€™m half way through with a layer2 build from Hedera Hashgraphsā€™ SDK - java using a java2py lib to enable HBAR- a highest ABFT consensus security measure- a public ledger.

To enable my Customersā€™ Solar industry Tech helped chat app to be able to bill their customers in HBAR - $ terms layered ontop of my Customerā€™s OpenAI pkey account.

Tokenizing usage in effect with a margin.

When this is done I will have 20hrs a week coming available - in case you know of any AI projects that need help. ty.

last 4 months studied and using Langchainā€™s libs, and Langsmith to tune prompts and chains. Databutton is kinda pretty but not deep enough as a streamlit based chat Dev tool IDEā€¦

Great. We hesitate to base our billing on any 3rd party library, so as long as OpenAI is capable of billing us for usage, we must have a proper API way to grab the same data from the API.

Ive noticed that when you set logprobs to true, even when streaming, the logprobs property on each ChatCompletionChunk.Choice contains entries which each have a property called token, containing a snippet of text from the content. That is, the naming suggests that there is one entry in logprobs for each token.

If this is true, then we can probably total up the response tokens (if not the request tokens) by adding the total number entries in all logprobs arrays?

In my own tests, Iā€™ve also noticed that, when streaming, by the above reasoning, there does appear to be only one token per chunk. That is, every logprobs Iā€™ve seen so far in a ChatCompletionChunk.Choice had only one entry, with a single ā€œtokenā€. If thatā€™s true, then you can perhaps, as others have said, just add up the number of chunks (though some chunks contain no content and no logprobs).

There is mostly one element per chunk. You cannot measure tokens by it, except to estimate a minimum.

However, it does seem to be a feasible concept to look at logprobs for some utility, because we get a list of logprobs per chunk.

Logprobs are blocked, however, when tool_call is emitted, so it is not a complete solution:

tool chunk_no: 0
Traceback (most recent call last):
    for index, prob in enumerate(chunk.choices[0].logprobs.content):
TypeError: 'NoneType' object is not iterable

gpt-4-vision = no logprobs either

Specifying tools to gpt-4-turbos steals extra tokens from content output because of shady tool tricks to prevent control.



For an example where ā€œchunksā€ does not equal tokens, letā€™s just get emojis, and make a clear presentation from logprobs within:

chunk_no: 0

chunk_no: 1
0: [240, 159, 152]
1: [128]

chunk_no: 2
0: [240, 159, 152]
1: [131]

chunk_no: 3
0: [240, 159, 152]
1: [132]

chunk_no: 4
0: [240, 159, 152]
1: [129]

chunk_no: 5
0: [240, 159, 152]
1: [134]

chunk_no: 6
response content:
šŸ˜€šŸ˜ƒšŸ˜„šŸ˜šŸ˜†
{'tool_calls': []}

The AI writes a tool call, though? No token count for you! Showing here:

chunk_no: 0
0: ('content', None)

chunk_no: 1

tools content:

{ā€˜tool_callsā€™: [{ā€˜idā€™: ā€˜call_idnumberā€™, ā€˜functionā€™: {ā€˜argumentsā€™: ā€˜ā€™, ā€˜nameā€™: ā€˜get_random_intā€™}, ā€˜typeā€™: ā€˜functionā€™}]}

(unseen overhead of tool_calls and max_tokens cuts off arguments)

Parsing code snippet, Python

response = your streaming API call, .chat.completions.with_raw_response.create()
reply=""
tools=[]

for chunk_no, chunk in enumerate(response.parse()):    # with_raw_response.create
    print(f"\nchunk_no: {chunk_no}")
    if chunk.choices[0].delta.content:                 # if chunks with assistant
        reply += chunk.choices[0].delta.content        # gather for chat history
        for index, prob in enumerate(chunk.choices[0].logprobs.content):
            print(index, end=': '); print(prob.bytes, end='\n')
    if chunk.choices[0].delta.tool_calls:              # if chunks with tool call
        for index, prob in enumerate(chunk.choices[0].logprobs):          # None
            print(index, end=': '); print(prob, end='\n')
        tools += chunk.choices[0].delta.tool_calls # gather ChoiceDeltaToolCall list

tools_obj = tool_list_to_tool_obj(tools)     # forum search: "messy tool deltas"
print("\nresponse content:\n" + reply)
print(tools_obj)

Any news on this? Is it closer to being released?