How to handle token rate limit while streaming the response

I want to use opaniAI api for a production system wherein i want to stream the response for the user. There is a possibility that the token rate limit might exceed while the response is being streamed to the user. In such a case, the user will only get to see half the response which is bad in terms of user interaction. Is there a way to handle this?? maybe by predicting the tokens that the model will generate or by getting the token rate limit error before the stream starts and not in the middle so that i can send the request to another instance of the LLM.

Hi @palashmunshi and welcome to the Forum.

One way to think about this problem is to ensure that you input tokens do not exceed a certain threshold, leaving always enough room for the max output tokens you are expecting to generate.

At the most, you are looking at 4096 output tokens - although in practice this number will typically be significantly lower. You can further control your output tokens via the max_output hyperparameter.