How does the usage limit work

Like many people I’m running into the issue where I can’t utilize the 128k context limit with the api because my usage tier is too low. But what doesn’t make sense is how it works. I have 10,000 TPM or token per minute. Which should mean that if I have 100,000k token document i should be able to upload it in chunks of 10,000 Tokens and then have to wait a minute before I can send the next chunk. But for some reason I simply can’t chat anymore with the api after I sent around 10,000k tokens even after I waited a minute. Can someone please explain why this is the case? Is the TPM a rate or a hard limit… and if it’s a hard limit why is it tokens *per minute * instead of just “token limit”?

There is no “uploading” to an AI model through the API. A chat completions endpoint has no memory.

If you say to a chatbot “here’s the first document I want to talk about”, “here’s the second document I want to talk about”, it is you getting yourself billed for telling the AI only what will come later, what you can construct yourself as a single API call.

Yes, you will hit the token input limit and be blocked if sending a request that appears larger than your token-per-minute limit from either input or the specification of the maximum response you want to get back. It would take your limit balance below 0.

Assistants may be a way around this, as far as actually having the ability to upload files, and to maintain a chat on OpenAI’s server. However it may still be the assistant software model call to the AI using those documents that would face your rate limit, but just like you could program yourself, the knowledge retrieved from documents can just be small chunks.

1 Like

Thanks for Responding, it really helps! Anyway The document was just an example. I know you can’t submit documents, but rather I meant i copy and paste portions of the text from the document into the chat. Eventually I would get the entire document into the chat history. The problem is that even after I wait a minute, I still can’t continue to feed it more words from the document no matter how small I make the chunks. It just tells me that I’m past my 10,000 tpm and that I should try again in 6ms. ( and yes I even waited six minutes and I still couldn’t send more information)

That’s what I’m saying, every time you “paste portions of the text from the document into the chat” you are just wasting money if you then send that to the AI.

The chat history must be held in your own software. You must send the growing chat history every time you “paste more” simply to spend 8000 tokens on not getting an answer yet.

And then a responsible chat management strategy like ChatGPT is to count the tokens of those chat history items. Then discard that which cannot be sent, whether that is a limit of 6000 input tokens so you can still get a 2000 token response from GPT-4-8k, or whether it is a limit of the maximum “rate” you can send without being blocked.

You can increase your usage tier and rate limit by pre-paying more into your account. Check at the bottom of your API account’s “limits” page under settings.

If you put 10,000 tokens in and expect 1 token out, you just broke your Tier limit. “Tokens” or the TPM quota, is Input + Output.

So at best. Send in X tokens, get out Y tokens, and X + Y < 10,000 at all times until you get to a higher tier.

So you have to erase your history of any previous (just drop it in the next request, and make sure X + Y < TPM) to maintain this.

Oh I see what you’re saying! So for example , if I have a 10,000 tpm limit and I send 5,000 token and the ai responds with 1,000 tokens, that would mean that I’m at 6,000 tokens. But ,within the same minute, even if my next input and output total was as tiny as 30 tokens, I would still be past my rate limit after because the ai will process not just the token that I sent but also the history tokens as well. 6,000+ 6,030> 10,000. That means the reason why I can’t send tokens even after a minute is because any output that I send there after will surpass my token limit because it has to process all of the tokens from the history which is again well over 10,000.

You’re saying the only solution to this is to either make an application that handles the memory betteror just increase my rate limit.

Yep, or you can do a fifo buffer with timestamps that does some figuring out when your held-back query can be submitted, with a little whirlygig animation that the question is waiting for the next time slot.

What you receive is counted from when you receive it, though.

Yes, that is what @_j is basically saying.

The API is stateless and you have to stay within your quotas (all of us do).

For example, I am now at Tier 5, and if I make more than 300,000 TPM to gpt-4-1106-preview, which can easily happen if I have 3 consecutive 120k inputs within a minute. Even though my RPM is 40. I will get a 429 error back.