Usage Info in API Responses

Hi everyone,

We have started providing token usage information as part of the responses from the completions, edits, and embeddings endpoints. This data is the same as what is shown on your usage dashboard, now made available through the API.

For example, a response from the completions endpoint now looks like:

{
 "id": "cmpl-uqkvlQyYK7bGYrRHQ0eXlWi8",
 "object": "text_completion",
 "created": 1589478378,
 "model": "text-davinci-002",
 "choices": [
  {
   "text": "\n\nThis is a test",
   "index": 0,
   "logprobs": null,
   "finish_reason": "length"
  }
 ],
 "usage": {
  "prompt_tokens": 5,
  "completion_tokens": 5,
  "total_tokens": 10
 }
}

You can find full details in the API Reference.

Note that for the completions endpoint, if the stream argument is enabled, the response stream remains unchanged and the usage information is not included.

18 Likes

Thanks, this is useful.

How do we calculate the exact total tokens used in streaming requests?

Thanks

The feature wasn’t enabled in streaming by default because we found that it could breaking existing integrations. It does exist though! If you would like it turned on, send us a message at help.openai.com

1 Like

Is it possible to put on every choice response how much tokens were used?

The scenario that I’m facing right now is that I want to make my request with an “n” of more than 1 and I need to catalog how much every completion cost. I could make completion_tokens / n, but it would not be accurate :smiling_face_with_tear:

1 Like

Please advise on how to enable usage data in completions for streaming mode.

2 Likes

You cant - and it is very annoying. We have to recreate the tokenizer and calculate the answer ourselves form the text that was returned

We are having accuracy issues though. The tokenizer and what you are billed do not always match

If you would like usage data enabled in streaming, please send a message to our support team at help.openai.com and we can enable the feature.

3 Likes

Hi Chris,

Is it possible to set up a channel that high volume (paying) users and SAAS providers can use to get support from OpenAI staff.

I understand you are super busy right now, but when we need to increase our rate limits or monthly account limits as we roll out a SAAS solution using OpenAI technology, it would be good to have a person or a channel that is not overloaded and that we can get timely responses from.

Like many other, I have asked about rate limit increases and had no reply. I often see messages on the forum about people asking to increase their monthly spend, and not getting a response either.

We are rolling out a product and can’t take on 1000 clients a week because we can’t be sure that the service will handle the requests we will need to send. So, for now, we are throttling our onboarding rate. It would be great if we could confidently “turn on the tap”

Maybe you have a support channel for people that spend over $x per week or month. Maybe you can automatically put people into this channel when they hit the limit so they can get priority support over the millions that are playing with the AI. This way you could support serious SAAS providers and high-volume users.

3 Likes

Hi Raymond,

I think alternatively you could check OpenAI services on Azure but seems much costly.

1 Like

I think they have the wrong price for tuned Davinci models. $34 per hour (Approx $24,000 per month)

I suspect this should be 0.34 per hour (Approx $244 per month)

They also have a Fine Tuned Codex

The don’t mention the versions for the base models either. I assume 003 for davinci - but the examples refer to 002

It looks like they expect you to fire up an instance, run it for a few hours and then shut it down.

Quoting their site:

“You now fine-tune a Curie model with your data, deploy the model and make 14.5M tokens over a 5-day period. You leave the model deployed for the full five days (120 hours) before you delete the endpoint. Here are the charges you will have incurred:”

2 Likes

Is this something you enable on a per-account level or on a per-API-token level? Is there a way to have existing integrations using streaming responses without usage info and then switch in a controlled way?

Hi @hallacy , we have opened a ticket in help.openai.com 2 weeks ago to enable the data usage in stream mode for text and chat completion. Nobody has answer on that. Could you please help?

2 Likes

Hello, @hallacy I want data usage information,i am using streaming, so how can I get data usage information?

1 Like

Hi Chris! I’ve messaged you through help.openai.com “Feature Request” to have this feature enabled. Can I also have this enabled when streaming please?

1 Like

Solo vengo a decir que es un desastre el chat. no se puede cambiar a español y no entiendo nada ni donde estoy escribiendo. siempre al comienzo puedes seleccionar el lenguaje pero aquí en donde hay “inteligencia” no se puede

Hi @dschnurr !
I noticed that on the usage page, i can see the number of requests and token usage per period, so is there any official API that can query the token usage of this conversation through “id”? “id” exists in both stream requests and normal requests. (“id”: “chatcmpl-74pW6*********************Wdi”)
thanks

can we see or fetch the data, that we have generated using our api key, input and output prompts costs usgaes…etc…?

Hello - I am also going through similar process. Would you be able to share what the usage info response looks like in Streaming mode?

Hi and welcome to the Developer Forum!

There are no usage info messages returned in streaming mode, you would need to concatenate all of your returned messages deltas and then use tiktoken to count the tokens used.

And thus the usage looks like you want it to when you do your own counting:

Well now, top o’ the mornin’ to ya! I’m Mac o’Paddy, the jolliest leprechaun
ye’ll ever meet. I’m a wee bit mischievous, but always with a heart full o’
gold. I’ve been wanderin’ these green hills of Ireland for centuries, guardin’
me pot o’ gold at the end of the rainbow. So, what brings ye to me humble abode today?
> [Finish reason: stop] 60 words/95 chunks, 95 tokens in 3.8 seconds.
##>Can you give the same introduction, but in Mandarin Chinese for my friend?
Ah, sure and begorrah! I’ll give it a go for your friend. In Mandarin
Chinese, it would go a little somethin’ like this:

早上好!我是麦克·奥帕迪,你会遇到的最快乐的小矮人。我有点淘气,但心里总是装满了金子。我在爱尔兰的这片绿色山丘上漫游了几个世纪,守护着我藏在彩虹尽头的金罐。那么,今天你和你的朋友来我这里有什么事呢?
> [Finish reason: stop] 24 words/133 chunks, 167 tokens in 6.7 seconds.
##>

Class I wrote for your use.

import re
import tiktoken  #  pip install tiktoken first
class Tokenizer:
    """ required: import tiktoken; import re; 
usage:
    tokenz = Tokenizer("cl100k_base")
    token_count = tokenz.count(my_string)
    print(f"The phrase {my_string} has a length {token_count}")
    """
    def __init__(self, model_name):
        self.tokenizer = tiktoken.get_encoding(model_name)
        self.chat_strip_match = re.compile(r'<\|.*?\|>')

    def ucount(self, text):
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

    def count(self, text):
        text = self.chat_strip_match.sub('', text)
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

Since there’s some special text stripped when sent to a chat endpoint, the normal count method will strip that too. Neither method counts as a single token the special control tokens that can be forced to be output.