Hi all! I’ve been working on a RAG app that gets a ticket, then reads through some documents to find the proper billing code. It has been working pretty great but for some reason yesterday I stopped getting responses on one of my test accounts. I looked up the thread and found that I was now getting this error:
" Run failed
Rate limit reached for gpt-4-turbo-preview in organization org-[org-id] on tokens per min (TPM): Limit 30000, Used 4811, Requested 28049. Please try again in 5.72s. Visit https://platform.openai.com/account/rate-limits to learn more.
I hadn’t changed anything on my implementation and it is only broken on a few threads, so I am confused as to why this is happening.
For clarity the prompt for the test is: “Explain [billing code]” . If I just ask something that doesn’t require a lookup it seems to be working as expected, e.g. “Hello!”.
Has anyone else ran into this? Do I need to stop reusing threads for a user or not use them for so long? I thought the docs said they auto-trimmed but maybe I’m mistaken.
Thanks for any help or clarification!