Not 100 in parallel?
Consider the limit to be continuous instead of discrete. I would use some variation of calculating the per-millisecond rate from your token and requests per minute organization limits. And then keep some record of the “expense” of each API call to queue them and adjust the depth of a parallel handler.
The difficulty is compounded in that the API limiter is based on input, but the current limit is based on past completed calls, so there is a delay in effect. And that you may have async parallel calls with a batch job or on demand yet to finish. And that some tier per-minute limits can be close to just allowing one maximum context calls per minute.
The headers will return various rate limit stats. For a particular model (or model class that shares a limit), you could start to back off further when the “remaining” is low, instead of strictly controlling by measuring your input.
x-ratelimit-limit-requests: 300
x-ratelimit-limit-tokens: 150000
x-ratelimit-remaining-requests: 299
x-ratelimit-remaining-tokens: 149605
x-ratelimit-reset-requests: 200ms
x-ratelimit-reset-tokens: 157ms
Just a simple loop of affordable API calls couldn’t make a dent in the “remaining” for me because of the fast reset and slow generation. (Several “500 server” errors though across 3.5 and 4, about 15%.)