Just as a friendly suggestion: Usage caps can unintentionally encourage consumers to use more in an attempt to max out their quota. Instead of these caps, I propose increasing delays in service requests, a bit like the 10 second timers next to free online downloads. Requests would eventually take prohibitively long. These load times can be dynamically adjusted to produce the same usage as current caps. This wouldn’t bother customers as much as a hard limit, because it still allows for the completion of important requests after the limitation has set in. A progressive increase in the response time would also serve as a feedback mechanism for their usage, that already influences behavior at low limitation levels. When things get a bit slow, you want to wait a bit to decrease your wait times. This is the opposite incentive from what we have now, where not having used much of ones quota yet will cause one to make more requests to not waste quota.
Interesting take, but there are more tasks that require bursty usage than not, many applications will make a relatively small number of grouped calls to the API to process data, very few real-world applications are vanilla “chat bots” with human scale time between events. Rate limiting is a standard across most commercial endpoints, it can also be implemented with very little computing overhead on a large scale.
Thanks for the reply.
The API version is a different platform that does not have the same usage caps. There you pay per token and set your own limits. I was talking about the “civilian vanilla chat bots” version.