I foundservice_tier: flex is very useful and easy to use compared to batch requests.
I want to use gpt 4.1 seriese (4.1, mini, nano) with flex requests to control temperature etc, is there plan for adding support for those?
I foundservice_tier: flex is very useful and easy to use compared to batch requests.
I want to use gpt 4.1 seriese (4.1, mini, nano) with flex requests to control temperature etc, is there plan for adding support for those?
Flex processing is in beta and currently only available for GPT-5, o3, and o4-mini models.
You seem to ascribe the non-support of other models as being due to not a per-model limitation but perhaps a limitation on passing sampling parameters?
Have I, by believing the documentation, not seen some other behavior than simply being denied by not trying?
gpt-4.1 and gpt-4.1 mini, temperature and top being sent:
event: error
data: {"type":"error","sequence_number":2,"error":{"type":"invalid_request_error","code":null,"message":"There was an issue with your request. Please check your inputs and try again","param":null}}
Completely remove the sampling parameters with some comments, and then attempting the flex service_tier:
if is_gpt5:
body["text"]["verbosity"] = verbosity
if is_reasoning:
reasoning: dict[str, object] = {"effort": reasoning_effort}
if reasoning_summary is not None:
reasoning["summary"] = reasoning_summary
body["reasoning"] = reasoning
else:
#body["temperature"] = temperature
#body["top_p"] = top_p
pass
The error event is identical: not demoting you to standard priority, not reporting on the intolerance of other models to the parameter, just failure with a streaming error event.
OpenAI never seems to discuss plans. They just yoinked endpoint support for file attachment content out of ChatKit, no ‘plan’.
You do make an interesting point for thought, though: calculation of temperature is 200_000 division operations per token. Is gpt-5 so unreliable and unusual in its output because it is simply computationally cheaper to pass un-scaled token certainties to the sampler in reasoning models? A better question is: When is OpenAI going to give control over sampling to more than gpt-5-chat-preview (non-thinking ChatGPT version), at any price?
Notes: flex pricing and its models: https://platform.openai.com/docs/pricing?latest-pricing=flex
My impression is that flex tier saves money by finding idle compute or efficient queuing, or cross-region inference, some combination of levers that makes inference times much slower, but cheaper to deliver.
This is possible for models that are deployed on lots and lots of servers at once, and therefore only possible with the highest traffic models, and there’s no reason to re-add gpt-4x tier models back when OpenAI is scaling up GPT-5 family and having it occupy the most servers. OpenAI also has the most interest in having people migrate to the latest models from a business perspective. Therefore it doesn’t make a whole lot of sense to enable flex tier pricing for older models.
If you write/vibe code a bit of tooling, you can reduce the overall annoyance of batch requests significantly, but I can also understand not wanting to use your time that way.
A better question is: When is OpenAI going to give control over sampling to more than
gpt-5-chat-preview(non-thinking ChatGPT version), at any price?
I think that’s a good question (slightly off topic but relavant). Those reasoning models are good at many tasks, but I found it’s much easier to control the behaviour using gpt-4.1 with lower temperature. Those reasoning models tend to shift the way they behave, so the outcome is somewhat unpredictable, making it less reliable in some case.
I think those flex tier is currently enabled for reasoning models only (flex is supported on o3, but not gpt-5-chat-previewand I was wondering if they start adding non-reasoning models to the flex tier, that’s basically what I want.
Therefore it doesn’t make a whole lot of sense to enable flex tier pricing for older models.
I understand they want users to move to gpt-5 series, I think they are more efficient model, but gpt-4.1 is still the smartest model for the non-reasoning. I think the difference is not old / new models but reasoning or non-reasoning.
If you write/vibe code a bit of tooling, you can reduce the overall annoyance of batch requests significantly, but I can also understand not wanting to use your time that way.
I agree, I just don’t want to see bunch of code to handle the batch requests where I can simply wait if flex is available ![]()