Given the tremendous load on GPT 3.5 / 4 the community sees quite a few timeouts.
I am building a tool to integrate and wanted to know what are reasonable SLA assumptions my program can makes:
Time to timeout → How long should I wait for a streaming response to come up with first char prior to timing out the request
Number concurrent reqs → When generating a large summary I tend to need to chunk, how many concurrent reqs should I allow for?
Cooldown → What is the correct strategy for cooldown? If the API returns a too busy, should I way 1 second? 5 seconds? etc?
Ideally some guidance like this would help?
- Wait up to 5 seconds for first char
- Feel free to run 4 concurrent from 1 IP
- Cooldown of 5 seconds is fine
Is this documented somewhere? Should the API self document and notify on concurrency constraints (via rate limits) or cooldown in HTTP error code?
Added bonus… if I tear down a connection prior to getting a response, do I still pay? If so how much?
Have you checked the docs for Rate Limits?
Cooldown: Exponential backoff. There are three examples in the docs on how to do this. But in general, try 5 seconds later, then 30 seconds, then 2 minutes, etc.
Concurrency: I haven’t had issues from multiple calls at the same time from the same IP (I usually call 3 at the same time from same IP). I would just try to stay within you TPM (tokens per minute) and RPM (requests per minute), per model, as these quotas are enforced against your Org/Key.
As for streaming: My understanding is that the latency is very much regional. So no one-size-fits-all answer here. But you could measure it, and use this as your guideline. For example, suppose these errors occur less than 5% of the time. Then gather the set of first token delays from streaming. Take the 95th percentile of this data, and for margin, multiply by something like 1.2. Then if anything exceeds this number, consider it a failed API call.
Another thing you should consider is model redundancy. I sometimes make 2 redundant model calls. For example, send your request to GPT-4-8k and GPT-4-32k at the same time. So no retries here, just model redundancy and most of the time one will work.
You can also have model fallback. So if there are issues, downgrade to DaVinci or GPT-3.5-Turbo (or both at the same time for redundancy).
It costs more to do it this way, but there is no latency in waiting for something to fail and then retry when only doing model redundancy.