GPT-4-Turbo: A Step Back in Logic and Consistency?

I replaced the GPT-4 model in my law reasoning/analysis system with a preview version of GPT-4-Turbo (“GPT-4-1106-preview”). After basic initial tests, it seems that while it is faster and much cheaper, it is unfortunately significantly degraded in quality compared to GPT-4.

The main issues are:

Impaired logical reasoning:
A significant portion of my work involves the analysis of legal acts, and its performance is noticeably inferior compared to that of GPT-4.

Increased variability in responses (less deterministic):
The responses fluctuate drastically when provided with the same prompt and parameters, especially when the LLM is instructed to perform scoring (how relevant is specific article).

I hope these issues are resolved with the release of the stable version promised in “a few weeks.” If not, then GPT-4-Turbo seems more akin to a “GPT-3.8-Turbo,” which could be useful in some cases, but not for the work that I am doing.


have the opposite problem with normal gpt-4, all of a sudden response are completely deterministic

1 Like

New model releases should consistently include benchmarks for a clear comparison of the changes between them. It seems to me that OpenAI only releases benchmark data when a new model outperforms its predecessor. There’s a tendency to withhold such information when introducing models with fewer capabilities. During yesterday’s presentation, Sam claimed that GPT-4 turbo is the most advanced version yet, but I’m skeptical without the hard data to back it up.

It’s become apparent to many that GPT-4’s performance has declined since its release in March, and OpenAI remains silent about it.

It looks like cutting costs has become the main focus, which is quite disappointing.


I am seeing it repeatedly make the same mistake when generating sql despite few shot examples and instructions about how to avoid the mistake.

1 Like