API rate limitations / api speed

jochenschultz · June 2, 2023, 8:19pm

Need to do some planning for an upcomming project. Let’s say I have 1.5 million documents I want to be analyzed.

They don’t really share a common context.
So what’s the best way to do so if I don’t need any direct output?

I was thinking about a 9 to 1 relation of gpt3.5 to gpt4 8k requests. Does it make any sense to plan a huge environment with multiple runners yet or would a single server with multiple instances get into rate limits anyways?

How long does it approximately take to raise the rate limit if that is possible at all?

bill.french · June 2, 2023, 8:31pm

Define what you mean by “analyzed”.

novaphil · June 2, 2023, 8:42pm

Rate Limits are applied based on API Organization, so multiple machines isn’t going to help with rate limits at all. (Potentially with total throughput speed since the API calls can be so slow, but you should be able to run a good number of requests on a single decent server, or do it serverless).

jochenschultz · June 2, 2023, 10:04pm

Here is a pseudo prompt:

Hey Bot, here is the content of a document,

[document]

analyze it for x… and y… and give me the answer in following json format:

{“x”:…

jochenschultz · June 2, 2023, 10:18pm

yeah 60 rpm, 1 request takes about 15 seconds.
So 15 parallel processes with 4 requests each minute. Should run on a good server yeah.

And if everything runs really smoothly it takes roughly 20 days.

And that does not even include the gpt-4 rate limits and slow api.

So how do I speed it up a little? Let’s say to a point where 1.5 million docs are handled on a thousand instances with 15 processes on each?

I just want to know the possibilities.

bill.french · June 3, 2023, 1:09am

Still a bit ambiguous. I get the gist, though. So basically, you are transforming 1.5 million documents from a current [presumably] text format into JSON comprising arbitrary metrics or analytics, right?

If so, what aspect of this process/content convinced you that an LLM was the right vehicle to achieve this transformation?

jochenschultz · June 3, 2023, 1:17am

As far as I have checked with some test prompts and as far as I believe I have evolved a natural understanding on when to expect hallucinations and in combination with statistical analytics on words in a preprocressing and taken in the business case doesn’t hurt if it gets wrong results here and there I would say yes. Definitely the right tool for the job in that chain of tools.

bill.french · June 3, 2023, 11:05pm

I was wondering “why” it was the right tool? Is it something about the document structures that prevent faster, cheaper, more traditional transformation approaches?

garyz · June 4, 2023, 12:22am

I couldn’t get parallel API calls working at all, with my single account, no matter what rate limit. If I have to make 50 API calls, I’d prefer to have all 50 at the same time and get all results in 10 seconds, instead of sequentially, with total processing time of 500 seconds.

I assume “analyzing 1.5 million documents” is a one time effort, you could create 100 temporary accounts, with different API keys, and then run this parallelly.

jochenschultz · June 4, 2023, 2:07am

Yes, but I will use traditional approaches as well.

jochenschultz · June 4, 2023, 2:10am

How did you test that? Threading?
Multiple instances on different ports?
From different IPs?

I mean without parallel api calls and 2 at best and up to 45 seconds api response times why do we even have a 60 rpm limit?

Got access to azure hosted openai modells.
They are even slower. I’d say 50% the speed as the api of openai.

Going to test both for parallel requests today.

Maybe different models in parallel?

garyz · June 4, 2023, 4:32am

I am using Python, I used asyncio and pandarallel for many different things. The former technically is just asynchrous, but if used for external API calls, the second request will go out while waiting for first response to come back. This will effectively achieve multiple API calls at the same time. Again, I couldn’t get this to work.

Depending on what you do, for OpenAI APIs, embedding request is much faster than completion request. You can watch this screen capture, I fast-forwarded portion of the video, but you can pause and see left-side timestamp, to get some ideas on the response times.

My embedding code (single thread) occasionally reached 60 RPM, though after 48 hours, the allowance would be much higher, as a pay-as-you-go user.

jochenschultz · June 4, 2023, 10:04pm

I did a simple test with PHP and started 3 different build in PHP Servers on port 5010, 5011 and 5012 and then placed the windows next to eachother on my screen and did a short click + F5 on all of them and they finished simultaniously.

So parallel calling of the API works for me (ACTUALLY I tried that on gpt-35 model hosted on azure which I got access to today - but using the same port with the same click+F5 did not work either - so… well… you will need to check that for yourself or wait until I did and post here).

Now I will use symfony/process and create a symfony command and spawn multiple cli instances which should also work then.

I guess everything on one instance could work as well (need to end the session then).

I am not into python so much but here is something that should be pretty close to that:

import subprocess
import requests
import time

# The scripts to be run
scripts = ["script1.py", "script2.py", "script3.py"]

# The ports they are running on
ports = [5005, 5006, 5007]

# Start all scripts
processes = [subprocess.Popen(["python", script]) for script in scripts]

try:
    # Wait until all servers are online and ready to receive requests
    # We assume that each script sets up a server listening to its respective port
    for port in ports:
        while True:
            try:
                response = requests.get(f"http://localhost:{port}")
                if response.status_code == 200:
                    break
            except requests.exceptions.RequestException:
                time.sleep(1)  # wait a bit before trying again

    # Here you should put your code to send the actual requests and handle the responses
    # For this example we'll just get the root '/' endpoint
    responses = [requests.get(f"http://localhost:{port}") for port in ports]
    for response in responses:
        print(response.text)

finally:
    # Ensure we clean up the child processes in any case
    for process in processes:
        process.kill()

Spawn 3 different python instances on different ports and call them with that script (ChatGPT made that haha - damn, I am lazy).
But looks pretty good. I think symfony also uses popen under the hood.

novaphil · June 5, 2023, 3:34am

Is PHP your preferred language? If so I’d have a database of document content (or IDs), and set up a basic database queue (been awhile since I’ve worked with Symfony, so not sure what they have these days, but pretty easy with Laravel). Setup multiple works, use openai-php/client to make the requests and store results in DB. Tricky part is obviously managing the global rate limit, but also individual API request failures since API is not rock solid.

jochenschultz · June 5, 2023, 3:41am

Of course PHP is my prefered language

Got RabbitMQ in my docker compose. If rate limits go up I’d go for that and distribute to multiple instances (maybe azure functions - using go).

novaphil · June 5, 2023, 3:50am

I mean, don’t even need multiple servers, but multiple workers on one server, since a single request takes so long. Rate limits are returned in the header, so can use that to do some dynamic scaling/backoff strategy. Add in some delays on 429’s and 5xx errors and just let 'er rip as fast as you can go if that’s the priority.

I wouldn’t expect any rate limit increases in the short term, seems like they are at capacity basically. So get creative/smart in your requests and queue strategy.

jochenschultz · June 6, 2023, 5:25am

Adding multiple docs into one evaluation might help. I guess that will speed up alot.

Keyvan · June 9, 2023, 4:54pm

Hallo Jochen,

Für dein Projekt empfehle ich, Vektortools wie Weaviate oder tenserflow zu nutzen und damit das Daten-Model zu trainieren. Somit verwendest du GPT3 api nur um eine Antwort auf die Daten, die du vorher schon durch Weaviate zusammengefasst hast. Ich habe bereits ein solches Projekt am Laufen.

Viel Erfolg bei deinem Projekt.

BG
Keyvan

jochenschultz · June 10, 2023, 1:18am

Hab was Ähnliches gemacht. Ich speichere Keywords zu bestimmten Dokumentarten und nehme Keyworddichte und clustere bestimmte Arten von Keywordkombinationen und speichere die Clusterdichte. Alles in MySQL.
Hab so schon vor 20 Jahren kontextbasiert Websites kategorisiert und reicht für mein Vorhaben. Außerdem bin ich unsterblich verliebt in SQL.
Ist nichts anderes als ML aber händisch.
Später kann ich dann abgleichen ob ein Dokument ähnliche Keywords enthält (nutze auch noch eine Damerau Levensthein Funktion und Synonyme).

stevenic · June 10, 2023, 4:52am

So the reality is that rate limits are currently being applied randomly. I have just as good a chance of being rate limited on my first request as my 100th request.

I was going to do something smart in my retry logic but I figured if they’re ignoring my remaining request limits why should I worry about being smart? I added a simple exponential retry policy and I haven’t had a 429 in 2 days.

For reference my policy is 3 tries with a 2 second delay after the first try and then a 5 second delay after the second. As I said, all my 429s have gone away

Topic		Replies	Views
Rate Limits for preview models? API gpt-4	11	5525	March 11, 2024
Parallelise calls to the API - is it possible and how? API	13	51522	December 13, 2023
Issues with API key - software gets stuck API api	8	333	October 30, 2024
Hitting Rate Limit with small group of Users? API api-rate-increase	14	6692	January 20, 2024
Unexpected $67 Token Usage Spike with Models I Never Use API	16	1044	March 21, 2025

API rate limitations / api speed

Related topics