- OpenAI responses don’t include any x-rate… header for me (why?).
- There’s no API for checking current limits.
- I am ready to calculate it on my end, but It’s unclear how to calculate tokens and requests for the Assistants API. Does every request count? Is calculating tokens just for messages added with addMessage sufficient?
Any clarification or advice is appreciated!
Hi and welcome to the Developer Forum!
This is all being looked at by the team building the assistants functionality, making a thread in API Feedback is the best way to show these are features you would like.
@Foxabilo thanks! no action required from my side, right?
Also, it is not clear for me… are the usage limits for each model calculated separately, or does reaching the limit for one affect the other? For instance, if I use 50 requests per minute on GPT-3 and have a limit of 50 for GPT-3 and 100 for GPT-4, does this leave me with 50 or 100 remaining requests for GPT-4?
Typically, different models have separate rate limits. but there are also rate limit Tiers 1,2,3,4 and 5 which have larger values for those who spend more and have been reliably making payments for longer periods. Details here:
Assistants can make multiple model calls autonomously and iteratively to only give back one response. Each of these internal AI calls will count against a rate limit in the API. One run may use indeterminate tokens. The number of calls indirectly determined by reading the number of run “steps” by API.
The models are pooled by type, although now there are many types where
preview is a different rate from
gpt-4. If you reach a “no more” point, this can affect multiple discrete model names. Someone really curious could rapidly poll two model calls sequentially and see where the requests per minute are inclusive of the previous model call.
One can imagine that assistants providing useful rate headers could be seen as another way to get consumed token counts and calls by assistants, something that OpenAI has shown they don’t want revealed at their immediately objectionable face value.
If not doing async parallel calls to muddy the statistics, a clever person could make small calls to the same model being employed before and after the assistant run to deduce the rate impact and token consumption.