OpenAi API - get usage tokens in response when set stream=True

The workaround I used was to get Chat GPT to create a paragraph in each language and then use the online tokenizer to calculate the number of tokens for each paragraph then divide by the number of characters in each language paragraph.

That gives me an average character to token ratio for reach language that I can store in a table.

I did this on a per language basis because the average number of characters per token can vary dramatically by language.

I’m then able to guestimate how many tokens would be used for any arbitrary number of characters. It’s not 100% accurate but it is very fast and it’s close enough (90%-ish).

Of course this does require you to know the language of the text up front and it wont work if the text has multiple languages.

In my case I’m only using it to roughly guesstimate how much each conversation costs.

1 Like

guys, unable to measure tokens usage is a special feature because we then can focus on the code, instead of the bill :rofl:

2 Likes

Yeah, why just not include “usage” section as described with non streamed response in final chunk just before [DONE]

1 Like

+1, really unfortunate that we are not getting that information.

2 Likes

+1 to this - it’s not clear whether the number of streaming chunks corresponds to number of tokens and I don’t want to estimate how much was used per request, ideally. Please include this in the final chunk!

1 Like

+1 - tiktoken workaround does not work when you include images in the messages. Include usage in the streaming response.

+1. Has anyone found a solution for this? Is the answer just to calculate token usage offline with tiktoken? Seems pretty counter-intuitive to not include the tokens used for the streaming api.

1 Like

I finally found a solution I’m happy with, after hours of scouring documentation. Unfortunately, they do not give an option to query for usage information by ID, or even just returning usage somehow; that would’ve been the easier solution. Instead, here’s my implementation. It involves:

  • Counting tokens for images with the new gpt-4-turbo/vision models
  • The scuffed and varied additional tokens that get added in with openai’s api
  • Wrapping the returned Stream generator, appending any tokens to a list before yielding, and finally processing the list as the output message

Implementation of the CountStreamTokens class (types are slightly scuffed):

Code:


def add_token_count(self, prompt_tokens: int, completion_tokens: int, model: str) -> None:
        # I append the tokens to a running total here. This will be called after the calculation is finished, as a callback. 
        # You can choose to do anything here with the numbers.
        self.detailed_usage.append({
            "model": model,
            "usage": {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens},
            "time": datetime.now()
        })

completion = openai.chat.completions.create(messages=messages, stream=True, **params)

# completion is now a generator, or a 'stream' object. 
# CountStreamTokens is a custom class that is initialized with the model you use, and the messages you want to query with. 
# These are saved as class attributes for use in the .wrap_stream_and_count() function.
# The .wrap_stream_and_count() returns another generator, yielding all the same tokens as OpenAI provides, 
# but simultaneously collecting the output tokens.
# When the generator detects a None (ending) token in the stream, 
# it yields the final token and begins counting tokens (as to keep the stream running)

return CountStreamTokens(model, messages).wrap_stream_and_count(completion, add_token_count)
1 Like

Please implement this in the final chunk. I don’t want to estimate something and introduce inaccuracies.

PS. (ahem Claude already tells me the usage in the final chunk of a stream ahem)

3 Likes

The OpenAI API compatible endpoint of the llama.cpp server does exactly this:

data: {“choices”:[{“finish_reason”:“stop”,“index”:0,“delta”:{}}],“created”:1713786803,“id”:“chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx”,“model”:“openchat-3.5-0106.Q5_K_M.gguf”,“object”:“chat.completion.chunk”,“usage”:{“completion_tokens”:18,“prompt_tokens”:28,“total_tokens”:46} }

(OpenAI here!)

We have added support for this, so no more workarounds needed! See Usage stats now available when using streaming with the Chat Completions API or Completions API

4 Likes

the same is happening to me, only while streaming.

The final chunk does not contain the usage