GPT-4 becoming dumber sometimes, for a while

After several weeks of running my translations code using the GPT-4 API, I’m puzzled by a certain behavior. It seems that most of the time GPT-4 is able to follow my instructions and translate the data, while not translating the context hints provided. But every once in a while it seems to go into a “dumb mode”, where it will just ignore some of the instructions and make a mess of the output.

What’s more curious is that this “dumb mode” seems to persist for several API calls, e.g. it’s not a one-off. I’ll then back off, wait an hour or so, and try again, and I’ll get good results again.

That’s… weird. I do expect variability, but not of this kind, especially as I’m using a temperature setting of 0 (also, presence_penalty and frequency_penalty are set to 0). With this temperature setting I would expect very little variability in the results.

This behavior seen from the outside looks just as if the “smart” GPT-4-0613 model was sometimes replaced with a dumber version, for several API calls.

Has anybody else seen this?

I’m using gpt-4-0613 and function_call functionality to get well-formed JSON.

I have seen cases such as this, in every one it was down to one of a few causes.

  1. The prompt and context is too long and goes over the max token limit.
  2. The prompt and context are not sent to the API call due to a programmatic error in data handling.
  3. The Prompt contains too many instructions at one time.
  4. Errors in the construction of prompts.

If you can share your prompt creation handling code and the code that calls the API and any support code it relies on, I would be happy to look at it for you.


Hey, thanks for offering to help. The code that creates the prompts is really complex at this point and I don’t think it makes sense to publish it. What it does, in brief is pack some basic instructions into the system prompt, take a document that describes the application and the strings to be translated into user prompts, together with the epilog.

The epilog is the most important thing here, because this is the only place that contains the instructions for processing.

My code does token estimation and packs the strings into batches, so that the entire request fits within the token limit, so explanation (1) does not apply. I also use a lower token limit because the GPT-4 API has trouble responding to larger requests because of its slowness. My total token counts (as returned by the API) are around 5500 tokens. The code is also fairly well tested at this point and I know that (2) doesn’t apply.

Causes (3) and (4) are quite probable, and I would be very grateful for any suggestions you might give.

Can you duplicate the API data please. What I mean by that is can you take a copy of the completions API call text and parameters and output them to a logfile.

Then repeat your testing until you find a translation that seems to be out of spec and then make a note so you can find it in the log file later.

I suspect there is something going wrong in your now complex code, I think we have spoken before on this topic, a few weeks ago.

I am doing that — logging requests and responses.

Examples of things that GPT does sometimes (things I call “dumb”):

  • numbering in the translated strings (which I now specifically ask it not to do), e.g. each string gets prefixed with “1.”, “2.” etc.

  • including the hints in the translated strings, either verbatim, or translated, also against the instructions

  • inserting an extra pair of double quotes in the translated strings

Now, I get that it might not follow the instructions, what puzzles me is that it doesn’t follow them sometimes.

As to your suggestions regarding my code, it’s not necessarily perfect, but if I’m sending 20 requests formed by the same code in a row, and I see a streak of 5 requests where GPT does not follow the instructions, while it does follow them for the other 15, it’s unlikely that my code is the cause of the problem here.

I’d be inclined to believe this was happening, due to the fact that OpenAI has had a lot of infrastructure problems, system problems [1], and staff seems underwhelming to handle the general demand. Constantly there are threads here where paying customers complain about having urgent problems and waiting for support.

I do not hold OpenAI in high regards if I’m being honest. There’s an entire moral debate to be had about how they’re making money off of copyrighted material in the first place, etc, etc.

So no one can really say if it is being dumbed down or not. Based on general network stress, or a system where they lower your “brain power” based on a rate limit. I would suggest you do scientific testing to confirm it, but that would cost a lot of money so no one should expect you to.

[1] This was the thread that prompted me to make my first post:

Ok, so lets take your first comment about telling the model not to do something, the current generation of LLM’s do not work well when told not to do a task, it’s like saying don’t think of a pink elephant. If you can reword your prompt to make it an action it should do rather than should not do, that would be much better for the model.

Additionally, current generation LLM’s will follow patterns given to them, so if you show an example of how you wish something done they will tend to produce output that matches that example.

Things like double quotes can also be shown by example not to be used by showing only single quotes in examples, also you can do string based post processing on the raw message content to filter out things like double quotes and escaped chars with standard regex and string processing commands.

LLM’s are never 100% deterministic and your code must sanitise errant data gracefully. If you are reliant on the model returning the exact same string across multiple runs and on varying data then that will end in disappointment unless sanitising procedures are put in place.

The types of things you are describing have always been present in the models from day one, you are now using the models fairly heavily, so edge cases are now displaying themselves more commonly.

You mention that the erroneous output seems to appear in clumps, that could be for a number of different reasons, the input text being scanned could be written by a particular author who uses a specific nomenclature that is causing issues, could be varying formatting in the source data, and it could also be a cyclic buffer type error where the code runs into issues over time.

You may also consider pre-processing input data to remove potential key problematic strings that, though testing, have indicated issues.

Hopefully this gives you some directions to investigate.

1 Like