OpenAi API - get usage tokens in response when set stream=True

In general, we can get tokens usage from response.usage.total_tokens, but when i set the parameter stream to True, for example:

def performRequestWithStreaming():
    openai.api_key = OPEN_AI_TOKEN
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "What is Python?"}],
        stream=True,
        temperature=0)

    for r in response:
        print(r)

all responses is like this:

{
  "choices": [
    {
      "delta": {
        "content": "."
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1680676704,
  "id": "chatcmpl-71r4iJF8s8R7Uedb4FZO13U5CPdTr",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {},
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "created": 1680676704,
  "id": "chatcmpl-71r4iJF8s8R7Uedb4FZO13U5CPdTr",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}

No usage property now, how can i know how many tokens are used?

14 Likes

I agree I would like to have the usage in the response to stream requests (at least in one of the events in the stream), similar to usage for non stream requests.

However, a solution for the meanwhile -

  1. The number of prompt tokens can be calculated offline using tiktoken in python for example (This is a guide I used)
  2. The number of the events in the stream should represent the number of the tokens in the response, so just count them while iterating,
  3. By adding 1 and 2 together you get the total number of tokens for the request.

I want to stress out that I would still like to receive the usage in the response to the request and not have to compute it myself.

3 Likes

I noticed that on the usage page, you can see the number of requests and token usage per period, so is there any official API that can query the token usage of this conversation through “id”? “id” exists in both stream requests and normal requests. (“id”: “chatcmpl-74pW6*********************Wdi”)

1 Like

However, after extensive testing, I found that the token value calculated by the calculator for offline token calculation is far from the actual value used. So Python’s tiktoken is not reliable.
However, please find attached the method I am currently using to calculate tokens:

  1. Each stream containing an answer is treated as a token, and when all these streams are added up, they are equal to all the tokens in this question. This is the method for calculating the response token.
  2. The token method used for questioning can be implemented using tiktoken (which is really impossible). Ha ha ha
1 Like

Can you clarify that? Question and Answer? Are you saying that each chunk that is returned in the answer (response) is equal to both the question (request) and answer (response). So the entire conversation token usage turn (question and answer) is basically the number of chunks returned in the stream?

1 Like

I think that’s what you mean. I don’t quite understand English. I am Chinese and I use translation software to translate your language, so there may be some discrepancies in the translation. sorry

1 Like

fwiw: an ideal place for it to be picked up would be when we receive the [DONE] message… Would like this too, would be nice to avoid using an instance of tiktoken to do all of this.

2 Likes

I’m not using stream=true currently, but isn’t each chunk a single token… so you can count # of chunks and have # of tokens?

1 Like

That’s just in the reply, the # of tokens consumed includes what you send, and parts of the data structure that makes up the messages array.

This package has a rubrik for figuring it out using a node implementation of tiktoken:

Still, would be nice to hear from the source what the total consumed was for the individual request.

1 Like

You can count only completion tokens using this way, not request tokens.

Looking up the ID would be good or making a tokenizer endpoint as well would be good.

2 Likes

Uh anyone noticed that in official API doc, they showed usage field in response chunk:

  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }

However, I am so sure I didn’t get one. Even in the final “stop” chunk. Anyone help?

4 Likes

I just checked all the chunks I get back and you’re right, none of them have the usage field, even though the documentation does. Weird.

That’s not in the chunk object or streaming example:

This can’t be right.

So we are purchasing tokens for a price per token, but we’re not allowed to know how many tokens a request has used?

This must be a bug.

5 Likes

Bumping this thread as this is a major hole in the current API. Specifically, streaming responses should include a usage object, either as a cumulative sum or alternatively alongside the final "finish_reason"="stop" chunk

Counting the number of chunks returned is not a valid workaround because (a) we have no explicit guarantee that each chunk is exactly equal to one token and (b) it can’t answer the number of prompt_tokens used in the completion request, even though we are billed for them

5 Likes

Well, you can just run tiktoken on each delta chunk and sum the results.

1 Like

Since it’s not even stated that chunks will be always at token breaks:

import tiktoken

class Tokenizer:
    def __init__(self, encoder="cl100k_base"):
        self.tokenizer = tiktoken.get_encoding(encoder)

    def tokens(self, text):  
        return len(self.tokenizer.encode(text))

count = Tokenizer()

# assemble AI `reply` as you would need to do to add to chat history
tokens = count.tokens(reply)

A clever person could even calculate a function_call return by putting it back in the emitted AI language.

2 Likes

Sure, that’s a “workaround”, but not a solution.

It (a) assumes tiktoken perfectly calculates the token in the API and (b) forces developers to add another dependency to their project, particularly when the only official version is the Python package (JS users will have to rely on their own choice of a third-party fork, which can be problematic for a few reasons).

I’d link to issues 22 and 97 on the github repo but don’t have the rep to add links…

It works when we use
result = await api.ChatEndpoint.GetCompletionAsync(chatRequest);

Unfortunately when we stream Usage is null.
result = await api.ChatEndpoint.StreamCompletionAsync(chatRequest, partialResponse =>
{
txtinfo = txtinfo + partialResponse.FirstChoice.Delta;
});

I have similar problem. To keep word “Processing” for at least 20s and keep track about usage or have responsible application without any clue about costs. I think that Usage info on official website has some delay.