I am using a detailed support prompt of 9k tokens to evaluate content written in a 10k-token text.
With my usage pattern, GPT-4o is completely unusable.
Similarly, Claude 3.5 Sonnet is also unusable.
These models either fail to follow instructions and truncate content, resulting in extremely poor-quality output, or they outright refuse to generate responses.
Since my task involves an evaluation process that follows detailed instructions, it is not well-suited for a reasoning-based model.
For this reason, I have been using GPT-4 Turbo up until now.
GPT-4 Turbo was the only model capable of producing proper output.
Today, I conducted an evaluation test with GPT-4.5, and not only did it process everything correctly, but the output quality also appears to be better than GPT-4 Turbo.
Since Dify does not support GPT-4.5 and the usage fees are extremely high, I have not been able to test it extensively. However, GPT-4.5 has left an extremely positive impression on me.
It was immediately clear from a brief test that this model is more expensive than the others, but I am uncertain whether I can afford the cost with my current usage pattern.
Each processing run involves multiple parallel tasks, along with result integration and preparation, which means I end up paying over $10 per run.
Fundamentally, the output is worth the cost, but convincing customers of this value is not easy.
If the pricing were slightly lower, I would strongly consider transitioning entirely to GPT-4.5.
Although the cost is a major barrier, I would be in serious trouble if this API were to be discontinued.
At the same time, I understand that for people who can only write low-quality prompts, this API would likely be of no value.
Please continue to maintain this model and API for those who can truly make use of them.