Hi there!
I’m implementing a service that allows users to send requests to OpenAI models. My prompts consist of both a fixed part and a variable part based on the view the user is working in. These views can range from a few hundred tokens to several thousand.
To ensure there’s enough space for the model to generate responses, I take into account for the context window of each model, leaving an overhead for the model’s answer and cutting the variable part short if needed.
However, with models that have large context windows, the overhead I leave sometimes exceeds the model’s maximum output token limit.
My question is: are these maximum output token limits stable parameters that won’t change when the models are updated? Can I reliably store a static map of each model and its corresponding max output tokens, so I don’t leave too much overhead for the model’s response?
Thanks in advance to everyone, I’m new to the forum, hoping to learn a lot!