Hypothetical Token-increase Strategy .

The below represents a hypothetical. The author has no connection to OpenAI, or any OpenAI entity, or role within the company, or any other connection or information other than what is publicly available as a subscription customer - nothing below is to be considered fact, and is clearly a snarky post from a frustrated user.

Hypothetically, if I were in real life a corporate strategist in big tech (hypothetically, of course), and I were to think of ways to increase token usage, I (hypothetically) would consider intentional GPT 4.5 responses to include incomplete and/or intentionally wrong code. Why? Plausible deniability, and maximum token consumption.

How would this work?

Hypothetically, I would ensure the code that GPT 4.5 provides to be correctly written with small interval corrections required.

Of course, an easy way for a user to realize the intentional mistakes would be as easy as the user not applying the intentional “oversights”, and instead to have the same GPT 4.5 review their own work each time and provide the user GPT 4.5’s own corrections on its own code.

While the user would surely not be able to conclude any of the interval mistakes as intentional, this process surely results in requiring far more tokens to achieve the same outcome that GPT 4.5 proves in its own reviews is capable of responding with on the first response. Still, it may be enough to cause the user a brief pause for consideration.

Which is why, hypothetically, plausible deniability built into an intentionally designed token-increase strategy is key to its success, no matter how easily discovered it may be - by even the most novice of LLM users.

Hypothetically, as a hypothetical big tech corporate strategist, I might consider this strategy as a sure fire way to increase token usage.

Hypothetically.

  • A Once Again Concerned User, Hypothetically
2 Likes

. . . not cool, guys. and very very obvious.

2 Likes

You mean the entry sentence like “here is… blah” and the following explanation like “what this does is…” without being asked for any explanations?

And even if you prompt “answer binary questions with yes or no” it will answer like

“Yes, that’s how you can do that. I am always happy to help! Have a nice day. happy coding - was a pleasure to talk to you. Let me summarize this for you…”

?

And if you really hate that stuff you can add like 2-3 sentences into the prompt so it doesn’t do that?

I wouldn’t go so far and think they are giving out wrong answers on purpose (very often), but even that really is strange to me.

The answers should always be as short as possible unless you ask for an extended version with entry sentence and explanation.

I appreciate the response, but I want to clarify my point.

I’m not claiming OpenAI is intentionally injecting mistakes (or is it?), but 4.5 objectively demonstrated inefficiencies, redundant corrections, and failure to follow explicit instructions—and it acknowledged this itself.

Instead of relying on my perception, I had 4.5 generate its own structured analysis of the session, outlining its mistakes, unnecessary requests, and inefficient token use. I’ve saved and stored the report to later reanalyze. While an LLM assessing itself isn’t absolute fact, every error it documents is verifiable within the interaction.

Key takeaways from the report:

  1. 4.5 provided incremental fixes instead of a single correct solution.

  2. It ignored explicit directives to wait for full context before responding.

  3. It repeatedly asked for confirmation on details that were already confirmed.

Facts:

  1. These behaviors inflate token usage while making responses less efficient.

  2. Whether by design or a byproduct of tuning, the effect is the same: users spend more tokens for what should be a one-step solution.

  3. These token-increasing behaviors did not exist in the 4.5 I used for the same project one week ago.

Thoughts?

1 Like

All I know is that they are experimenting with a safeguard system there.

https://openai.com/index/chain-of-thought-monitoring/

1 Like

Thanks for sharing this, I actually found it really interesting. From what I understand, CoT Monitoring is meant to improve reasoning and reduce inefficiencies, but based on my experience, it seems like the actual outcome isn’t aligning with that goal yet. In fact, 4.5’s behavior feels like it’s introducing the very behaviors it’s intended to reduce - more redundancy, more confirmations, more incremental corrections, blatant violations of user instructions, etc, instead of direct solutions.

Maybe not some vast token conspiracy, but if I were a token-based business model, I wouldn’t hate the idea lol.

But thank you, I think you’re probably right here

1 Like

I said they are experimenting with it. I wouldn’t expect a newly deployed product in IT to be stable in the first 24 months.
That didn’t happen once in the history of IT.
Ok, maybe in the moonlander code.

1 Like

Might also explain the extremely limited interaction availability to 4.5 (at least for me).

Had I not been so impressed with the first experience (and I mean blown away) I would not be so frustrated with what was objectively a far less effective tool on the second go round.

Now, get back to work and go fix it! :slight_smile:

They’ve limited o3-mini as well.

Previously it was possible to generate 2000+ lines of code. Now it hardly does 1000 and it feels like gpt-4o mini…

Even answers with “I can’t write so much code at once”…

1 Like

I think there is a world of difference between implying a deliberate act to milk funds from users and observations about how LLM have always performed and probably will continue to perform without an element of " cat herding" being applied.

1 Like

Whatever the reason behind the “feels”, this is not new.

I use a lot of different LLM and coding tools. Users of those LLM and tools, without fail, get the same feels. It could be Sonnet, it could be Cursor or Deepseek. Everything seems to be impacted by these same “feels”, and it’s almost cyclical.

Way beyond my paygrade to understand the user psychology or the potential technical limitations / influences on performance. My point is that this is not a phenomenon unique to 4.5 or OpenAI.

Thank you, I appreciate your views, and overall, I agree with you. There’s nothing unusual about any LLM underperforming for a variety of reasons—technical, tuning-related, or otherwise.

That said, I do want to clarify my point:

While my initial post was obviously snarky frustration, the behavioral shift I experienced with 4.5 wasn’t just anecdotal—it was systemic, documented, and repeatable. It wasn’t just inefficiency or random degradation. Instead, it followed a highly specific pattern indicative of industry wide market approaches.

These aren’t just random inefficiencies—they are behaviors that objectively increase token consumption and are undeniably built in throttles - whether a momentary byproduct of tuning or long-term goal, the effect is the same: users spend more tokens to achieve what could be a single-step solution.

To be clear, I’m not making a claim about OpenAI’s intent - it was a wink and nod to the developers that they may have tuned it too high. The practice of subtly designing SaaS products to maximize engagement and consumption is neither new nor rare. It’s standard across the industry. LLMs are no exception, and this kind of design is part of every company’s GTM.

The only thing that would be unusual would be an LLM that didn’t employ these strategies to some extent.

I’ve seen the OPPOSITE recently with 4o…

It seems to peter out around 4k or 5k tokens output instead of giving me the desired 8k to 12k of output tokens.

Despite giving it a detailed outline, it’ll shorten sections or make single sentence paragraphs to stay under the 5k (?) window…

Occasionally it will output the entire thing in one-go, but those times it usually freaks out with the tables in markdown and starts generating nonsense at the end…

1 Like

I really start to have some difficulty with statements like “a highly specific pattern indicative of industry wide market approaches” and so forth.

If there is some evidence of things such as “undeniably built in throttles” then I’m not aware of it. Of course, I’m hardly aware of much these days. Conspiracy theories are everywhere.

Look, you may well be 100% correct although I’m personally having a hard time with all of that.

fair point, a truth bomb I shall accept. Maybe my wording was as heavy handed as the tuning of 4.5 that it was used to use a bit of humor to request that OpenAI share their toys with the rest of us. Again, nothing about this is unsual.

I am sure you have used Claude, or any other of the products from Anthropic. You may have noticed Claude does not have user continuity and resets on every new chat. This is not because Anthropic doesn’t know how to maintain continuity, or have the resources capable of cross chat continuity - they do. They made a stratgeic business descion that ensures higher token consumption (and yes, there are other reasons) . Point I am trying to make is that token strategy, even the ones that tune down and throttle existing LLM capabilities, is part of every business descion a token-based business employs, and nothing about that is unusual.

The main issue besides how easy that is to deturmine is the fact that it’s morally wrong. I dont care if I can get away with something or not. I care that I do what’s right in a world where sentiment like your perspective is actual common place with companies.

We used to live in a world where customer was king and so was our products. We sought out to create the best possible product however most companies anymore consider the customer an afterthought. It’s more like how hard can we screw the customer and get away with it. This seems like a sentiment that you’re sharing and is really common place unfortunately.

However I don’t care about money I care about making a quality product for a value that most people can have access to. I guess that’s why I do advanced research teaching llms and nlps EQ & EI. Not to make money but to contribute to society in a way that only I uniquely can due to my particular skill set and genetic mutations/malformations/environmental factors.

Just saying though, if you ever Implement tha; first of all you’d be caught pretty easy second of all you’re the one that’s got to look in the mirror everyday and unless you’re a certain kind of person you should have an issue with that.

“All AI companies use token increase strategies to maximize profit” is the same as saying “All banks use transaction fees and interest rates to maximize profit.”

this is very likely to be cost. It’s reportedly a very expensive model (indeed) to run. I blew $70 on day one on it on the API in just a few interactions. So the $20 ChatGPT sub is a bargain and would be more so if they didn’t ration it.

You might want to consider the API strategies in your analysis.

2 Likes

I have some plans with that as well. Any advice?

GPT 4.5 wasn’t intended to be strong on code writing. This is public knowledge.

We’re releasing a research preview of GPT‑4.5—our largest and best model for chat yet.

I’ve personally see it do stupid things, like putting variables in the wrong scope.

1 Like