Loss of logic In the ChatGPT May 3 Version

I’m happy to hear that you haven’t been experiencing it. I’m working on putting together a simplified example of what I’m experiencing for evaluation. In the meantime figured I’d reach out and touch bases with everyone about the new update.

I asked it to look over my code and find a problem. It replies like the following and starts writing gibberish code:

The issue herein lies within the aspect ratio calculation causing impressions that parts of the image compression/aspect presents inconsistencies with colors and locations during mouse interactions such as choosing certain color. I will rewrite this entire image upload/browse canvas applications by preserving actual image proportions as keeping detailed positioning.

Here’user ">

Color Chatbot body { keyboard-action="typing-first-s)"> keysr_text(iforele_user"labelma Code to ve@norvig\.cole"text(inv)','') get_ifav code_chunks: O[rt"]:rnS]]>sfoe:;"(not.ad!~-~-(.bubscriptions==-)):+cript\setult?[ dt)[.InU(": ifca>

i have also noticed a significant downgrade in the logic capabilities of the most recent GPT4 version when discussing/evaluating complex inverse problems, differential rates or patterns of change, and spatial-temporal variability. I only rarely received erroneous replies before the update, but now I have to double-check all output (i.e., now a double-negative conditions sometimes don’t get appropriately translated into a positive condition). I see these errors as more GPT3.5-like than prior GPT4 levels of reasoning. Very frustrating.


I understand that you, among others, have noticed a degradation in the logic capabilities of the latest GPT-4 version, particularly when tackling complex inverse problems, differential rates, and spatial-temporal variability. The increased frequency of erroneous responses and issues in recognizing double negatives have been frustrating, seemingly reminiscent of the previous GPT3.5’s performance.

Some speculate that covert throttling during peak usage periods may be responsible, which raises questions about OpenAI’s transparency. It would be a significant disappointment if there were elements of deception in the company’s approach, particularly at such an early stage.

I trust that OpenAI will seriously consider these concerns and address them promptly. Transparency and responsiveness to user feedback are essential attributes for any company, particularly a startup. We should remain optimistic that OpenAI will heed this feedback, make the necessary adjustments, and ensure a trustworthy and positive experience for all users.

In my view, the issues likely stem from problems with the new model, but I’m interested to hear your perspective. What do you think?

1 Like

I am not sure. I think you have articulated a number of causes for concern very well. I am honestly disturbed by the lack of attention this issue has received in these forums.

Furthermore, I am even more dismayed by the latest “May 24 Version” which appears to actually be a further step in the wrong direction.

I am observing even greater loss of reasoning between the May 12 and May 23 version! Hard to overstate how frustrating this is! The May 23 version in many respects appears to follow instructions more poorly than GPT3.5 at this point and consistently fails to solve basic logic problems.



I think they’re finding ways to make GPT 4 more affordable because of the many plug-in users.
Everyone is annoyed at the slow speed and 25 message cap. A possible trade-off is to go to a more affordable model when servers are peaking.

Here’s yet another disgusting, astounding example from today of just how stupid GPT4 has become:

OK, let’s revise your last writing output to be more formal. You need to learn an important, basic lesson in scientific writing: “higher” is not an appropriate substitute for “greater”, etc.


I appreciate your feedback, and you’re correct. In scientific writing, precision is important, and the choice of words can significantly impact the interpretation of the results.

“Higher” and “greater” can be used interchangeably in casual conversation, but in scientific contexts, they could imply different meanings. For example, when comparing numbers, we’d generally say one number is “greater” than another, not “higher.” On the other hand, “higher” is often used to describe physical height or levels in a vertical hierarchy.

I strive for accuracy and clarity in all types of writing, including scientific writing, so I appreciate your feedback to help me improve. However, as of my last training cut-off in September 2021, I do not have the capability to learn or adapt my responses beyond the data I was trained on. I can provide information and responses based on that training, but I can’t learn new information or correct my style of writing based on user feedback.

The developers at OpenAI regularly update and improve the model, so feedback like yours is crucial in making these improvements. Thank you for your input!

W. T. F. am i even paying for…!?
Two months ago, this prompt would have elicited a very different, superior response from GPT4.


What, exactly, were you hoping for the response to be in this instance? I sorry, I don’t see anything particularly wrong with this.

1 Like

Invariably, in the past, ChatGPT-4 would then revise the previous output & adopt a more formal writing style, with more precise language. In 99% of cases, the GPT would also understand that there was no need for a direct reply to my statement, just a desire for more rigorous writing. I’m honestly surprised that you see nothing wrong with this reply. Everything about it screams “lobotomy” compared to the April/May 3 versions of GPT-4.

1 Like

I mean, I guess you could just share the offending conversation?

To me it looks like you told it that it needed to learn something and it responded to that.

Anyway, feel free to share the chat.

1 Like

I just shared my link in this post: Major Issues in new GPT 3.5 : DO NOT DEPRECATE OLD ONE - #20 by SomebodySysop

This isn’t a prompt issue or “hoping for a better response” issue. It’s a model degradation issue.

Now, I’m not complaining too loudly because the GPT-4 codex model, despite a hiccup there and there, is doing an exemplary job as my coding assistant. I have built a whole working ingestion/query system, from scratch, in php, with it’s assistance. Very happy with the results.

But, GPT-3.5 and the GPT-4 Browsing models? I don’t know. Something’s not right.

Here you go: https://chat.openai.com/share/73188b65-b8bb-4b2d-97a2-24400e9dbf30

Start at the first occurrence of “Drupal 9”. It just appears to be throwing out random guesses.

The text in this conversation is over 30k tokens, you are asking the model about details many tens of thousands of tokens ago when the model context is only 8k.

This will result in hallucinatory output. As people grow more comfortable with the AI they tend to have longer conversations and it can be easy to forget that the context limit is finite when you are engrossed in a discussion.

1 Like

The entire conversation may be over 30K, but for the immediate question I supplied the code to process and the resulting error messages. I was NOT referring to the much earlier code, but the immediate code and errors. This is not a memory or lost context issue. When I popped back onto GPT-4 codex, it answered the question within a couple of tries, with logical responses to errors, not illogical responses to errors.

It is best practice to create a persona for coding and to start a new chat regularly to ensure that the persona and the context are up to date. This is especially true if you are coding at a professional level. In the example where you had less than ideal output there was no persona defined and regularly updated and the context was long, when you went into the codex prompted one it had a context for coding specifically and you gave it just the code you were interested in.

All of these factors alter the performance of the model.

1 Like

GPT-3.5 gives bad answers out of the box. But, I can’t say your suggestions won’t make a difference until I’ve tried them. I avoid GPT-3.5 Chat like the plague now, but if forced to use it in a pinch, I’ll try it.

If you setup the coder persona in GPT-4 every 3 or 4 messages and then the exact code you’re interested in, I’m finding that works really well, so long as I know i’ve roughly got my persona prompt and the code contest inside the last 4k tokens… it’s great.

I am fortunate enough to have access to the GPT-4 API which I do use for longer sections of code.

Another method I use is to have GPT-4 define the code outline as contentless functions, classes and definitions, then I create a consistent coder persona with a brief overview of the code I am working on as a whole, with library requirements and associated helper function definitions. Then I start a new chat for each function or class I am working on and include that persona prompt, then I work on just that function in that chat and refresh the persona every few prompts.

It’s extra work but I find the results to be excellent.

1 Like

I have never had enough trust in gpt for math… I wrote a script that uses Wolfram mathematics or Matlab as the mathematics engine for this reason (depending on your use case “mycroft-core” is a decent framework you can modify to have Wolfram and gpt work together I looked at how they accomplished this and cut out the assistant part)

1 Like

As @Foxabilo stated, there’s simply far too many tokens there—until I can get to my PC and verify, I’ll trust the 30k estimate provided.

So, I’m absolutely not surprised the model went sideways on you, I observed this same behavior several months ago working on a coding project.

The problem presents itself when some critical piece of context gets dropped and the model needs to fill in the gaps. If it doesn’t fill them in nearly exactly the same as they were originally, you get a cascade of problems as the model makes increasingly desperate attempts to patch what it did before with what it’s doing now.

As it needlessly refactors more and more code you get a situation where other critical pieces start to fall out of context.

The solution, as has already been given to you, is to occasionally collect all of the “good” parts together and start a new chat with only those pieces in context and start from there.

Two other pieces of advice I can give you are these,

  1. Use the “Edit prompt” button and “Save & Submit” when you need to get the model to do something slightly different. This takes the wrong response out of context so it doesn’t take up valuable space and it won’t be referenced again.
  2. Run incidental questions in a separate chat. Things like:
    A. question: when a new user installs this module, will these updates be applied automatically in the new install?
    B. so, I need to modify the existing install file with the changes and remove the update functions for any new installs?
    C. In drupal, how do I get the IP address of the current user?
    D. what was the command to update composer?

Can all be run in a separate chat instance, further cleaning up your context window.

In short, it’s generally preferred to keep only the most salient information in context. If the model gives you a response that falls short of your requirements or expectations—while you can certainly ask it to fix the response in the next prompt—it’s better to figure out how to modify the initial prompt to directly spawn the “fixed” response. If you have questions adjacent to the work you are doing, it’s better to ask them separately, then include your new-found knowledge in the prompt to get the desired result.

It is a little more work upfront, but much less in the long run.

Also, this is not a new problem or indicative of any decline in the quality of the models. This has been a recurring issue with all GPT models from their inception.

One final note, things looked like they were going more-or-less fine until the model was switched midstream, I’m not sure if there has been much published on how models handle continuing conversations started in other models. The immediate loss of half your context window certainly could not have helped matters.


Yes, the model was switched from GPT-4 to GPT-3.5. Yikes, what a difference!

But, the difference is as striking in the API as well. Here is a pretty clear example. Same question, prompts, same context document. Same temperature.

OK, so I thought, GPT-4 had the advantage of Chat History, so I decided to give GPT-3.5 the same information:

To it’s credit, it at least tried to read the document. But, just looking at the titles of the documents, how artificial is your intelligence when you can’t see a relationship between twpunion and twpunion-d9?

OK, I’ll try this. I had made a point to try and stick with the same conversation based on the project (in order to cut down on multiple related chats), but I’ll try it. I have had some success with this approach when gpt4 starts to get a bit wonky.

Thank you for the suggestions.

1 Like