Sudden rise in "prompt_tokens" mid stream resulting in "rate_limit_exceeded"

I am using the nodejs sdk.
I got “rate_limit_exceeded” error when running a thread with these settings:

stream: true,
tools: tools,
tool_choice: “required”,
parallel_tool_calls: false,
truncation_strategy: {
type: “last_messages”,
last_messages: 5,
},

After analyzing the logs I found that after some steps was completed the error was raised with "prompt_tokens":12125 which is way higher than the previous completed step "total_tokens":1836. This was mid stream so no other prompts was sent to the model.

The strange thing is that I think that everything the run should have produced has already been produced. So I think instead of raising this error the run should’ve been completed at this point.

This is the progression of the usage through the run:

"usage":{"prompt_tokens":318,"completion_tokens":15,"total_tokens":333,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":362,"completion_tokens":17,"total_tokens":379,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":408,"completion_tokens":155,"total_tokens":563,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":621,"completion_tokens":133,"total_tokens":754,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":783,"completion_tokens":18,"total_tokens":801,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":831,"completion_tokens":120,"total_tokens":951,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":981,"completion_tokens":18,"total_tokens":999,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":1030,"completion_tokens":216,"total_tokens":1246,"prompt_token_details":{"cached_tokens":0}}
"usage":{"prompt_tokens":1277,"completion_tokens":153,"total_tokens":1430,"prompt_token_details":{"cached_tokens":1152}}
"usage":{"prompt_tokens":1459,"completion_tokens":167,"total_tokens":1626,"prompt_token_details":{"cached_tokens":1408}}
"usage":{"prompt_tokens":1653,"completion_tokens":127,"total_tokens":1780,"prompt_token_details":{"cached_tokens":1536}}
"usage":{"prompt_tokens":1812,"completion_tokens":24,"total_tokens":1836,"prompt_token_details":{"cached_tokens":1664}}
"usage":{"prompt_tokens":12125,"completion_tokens":1180,"total_tokens":13305,"prompt_token_details":{"cached_tokens":5760}}

Also can someone tell me what is the "“cached_tokens”?

I don’t know if this is a bug or something I don’t understand but either way I think this event of “rate_limit_exceeded” shouldn’t be this fatal . I mean there was no chance for me to intercept it and may be pause the execution for a couple of seconds and then resume the run.

The run failed and there was no way to resume it, So now I must undo everything it did then start it again and probably will have the same error again!

This is all based on my understanding to the whole thing so if I am wrong about something I would really appreciate it if someone would clarify.

Thanks,
Gado

Welcome to the dev forum @mostafa.gado

The increase in tokens is because of the context being populated from the vector store.

Cached tokens is a result of prompt caching on previous inputs.

You can increase rate-limits by upgrading to Tier-2 or above.

1 Like

Thanks for your response and sorry for replying this late.
Well I am not using any files for the vector store. When I list the vector stores I got nothing:

{
    "object": "list",
    "data": [],
    "first_id": null,
    "last_id": null,
    "has_more": false
}

Second I always start with a new thread and I am not appending previos messages. In fact I am using a 5 last messages truncation strategy:

 truncation_strategy: {
          type: "last_messages", 
          last_messages: 5, 
        },

I don’t think that any of those things justifies this sudden spike in token usage.

I added a max_prompt_tokens: 2000, and will monitor again.

Thanks,
Gado