My goal is to make my own langchain chatbot assistant/agent, and so far I have managed to scrape the entire documentation of langchain into html parsed human readable text files. They are saved into 27 chapter directories. I also have the master url api reference page.
I was thinking of utilizing a lazy load implementation, so I wouldn’t ever need to ingest the whole thing at once, and it would always be up to date on updated documentation. I am also down to explore any and all possible solutions, but I want to make it work with the basic user openai account with 3 api requests per minute, just out of principle and proof of concept.
Also, forgive my naivete; I’m pretty new to this LLM paradigm, but want to learn and share. Thanks in advance!
IIRC standard rate limit as of now is 3 RPM, 150000 TPM and 200 RPD for text-embedding-ada-002
Here’s what you can do:
Batch your requests: You can send an array or list instead of a single input with each under 8192 tokens. The array can be as long as 1500 elements AFAIK.
Honor your request rate-limit: Make sure you are sending only 3 request/min to the API endpoint. You can do so by implementing a cooldown delay of 60 seconds after every 3 API calls.
Honor token rate limits: Make sure that the total tokens sent per-minute is less that 150000. You can do so by counting tokens with tiktoken
For lazy-load, you can embed only to headings and subheadings. This way you won’t need to embed the whole doc at once. Once you get matches for specific headings based on the user query, you can embed the body by dividing it into appropriately sized chunks, storing their embeddings for future and load the relevant chunks for chat completion.
48-hours after establishing a payment method the rate limits jump to,
3,500 RPM
350,000 TPM
according to the documentation, though my portal reports,
text-embedding-ada-002
1,000,000 TPM
3,000 RPM
So, I would encourage the user to set up a pay-as-you-go about and just not worry about it.
Though, even on the free tier they should be able to power through it in about 7–10 minutes (depending on how much overlap there is in the embeddings), doing just one batched request per minute.
Good point. I will most likely go this route in the future. Makes a lot of sense. I am still glad to know that it is possible to do in a few minutes without setting up a payment account, but I see the advantage to having one. Thank you!