Thanks for the help @_j. At this point, >99% of my input tokens in each call are hitting the cache, which is awesome because it’s 90% cheaper and quite a bit faster than non-cached input. I’m sticking with 5 concurrent calls at runtime because this program will run 1-2 times per minute on average, and once you get up near 15 requests per minute, you risk getting routed to a different server that doesn’t have the cache stored.
The only other thing that would make a meaningful impact for me is getting data on server strain at different times of the week, so that I can simply avoid doing research/engineering when the server is overloaded. While OpenAI doesn’t make this data public, I did find a post you made last year benchmarking performance throughout the week. Are you open to sharing how you gathered this data? I’d like to collect some more recent data (perhaps a real-time dashboard?) and make it publicly available so that others don’t face the same issue. I imagine this would help a lot of people because the official API status dashboard provides info on outages, but not latency. If interested, we can collaborate ![]()