Quality Deteriorates as Interactions Continue

rex.vanhorn · August 2, 2025, 4:51pm

Hello, community.

I’ve noticed in several different settings that the quality of responses deteriorates as the number of interactions (prompt/response) increases, even when the interactions are completely unrelated to each other.

Two examples:

I had a list of local business (name + address) and wanted to determine attributes of those business (e.g., mission statement). I used the API and Excel, prompting GPT-4o w/ web search to answer 5 questions about each business. The first thousand or so responses were awesome, then the answers oscillated between clusters of acceptable and obviously wrong. No memory or previous messages were used in the prompts; each interaction was completely separate. Eyeballing the responses, it seemed like GPT-4o would get lazy and just reply with a guess, then have a Red Bull and thereafter offer correct responses until it got ‘tired’ again and the cycle repeated.
I’ve been prompting GPT-4o in the ChatGPT (-4o, -4.1 and -4.5) interface with a series of information from PDFs. The first ~25 answers were great, the next 10 were OK, but I would find obvious mistakes where its responses were oblivious; e.g., it would say that “X” didn’t happen, and I would provide a line from the PDF clearly saying “X” did happen, and then it would apologize and continue with good answers. Finally, around the 35th PDF, GPT said that it couldn’t read/parse PDFs at all. It insisted it could not read PDFs, even while quoting me its documentation where it said that it could read/parse PDFs. After about 20 back-and-forths, I started a new project, asked a few questions, then went back to the original project and was able to pick up where I left off, with GPT reading PDFs with no problem.

Has anyone else noticed this, and/or have a thought as to why this happens and how to avoid it?

Thank you!
-Rex
University of Georgia

phyde1001 · August 2, 2025, 5:13pm

Hi,

The OpenAI models have something called a ‘Context Window’… This is variable between models…

This might be around 16,000 tokens to 1,000,000 depending on the model.

A token is about 4 characters.

After this window it can only see partial data from what you have submitted up to this window limit.

Try new chats for each PDF, or extracting the information from PDF if large so it fits in the context window for the model, it sounds likely it will fix your problem.

rex.vanhorn · August 2, 2025, 5:54pm

Thank you, @phyde1001, for your response.
I don’t think the context window is the issue because in the first case, each prompt was only around 200 tokens; and it worked then didn’t then did then didn’t…

The context window might contribute in the second case because the GUI does use “memory” to fill up the context window, but this wouldn’t explain why GPT ‘forgot’ that it can read PDFs.

phyde1001 · August 2, 2025, 5:57pm

Hi,

Are you counting the PDFs/Excel in the prompt token count? 200 tokens seems like not.

_j · August 3, 2025, 1:59am

The AI models have something called ‘self-attention layers’, a limited resource, that is also part of the training. By reinforcement learning, this is used to form the internal associations and references that are of importance in generating quality tokens with lower loss.

When you have thousands and thousands of input tokens on a model that can produce 100 tokens per second, the comprehension of context that helps to predict certainties of logits starts to take on a presentation more like retrieval augmentation, not total understanding.

If you sent “talk like a pirate” and “pirate mode off” in some past messages, context attention to an extensive chat is needed. The AI has also learned “chat”, where paying attention to system messages and the newest messages gives high reward in successful output sequences. The fallback one sees with meandering input is a reversion to chat post-training.

Ultimately: budget and keep input length low to keep task focus high.

rex.vanhorn · August 4, 2025, 12:42pm

Hi, @phyde1001
You’re right that scenario (2) incorporates more than 200 tokens. But scenario (1) uses the API to send the ~200 token prompt over and over with no memory. In the case of (1), @_j, the task is laser-focused on searching the Internet to answer these questions.

htperkins · August 4, 2025, 12:50pm

I notice this especially with o3. DOesn’t matter what it does, once we reach 10 messages he’s in a rush and has to go, and we can do whatever the hell we were going to do later.

phyde1001 · August 4, 2025, 12:50pm

OK can you be a bit more clear on 1)?

You mention Excel and Searching the Internet…

Is this with the API?

Any Excel or Web searching will increase token counts

_j · August 4, 2025, 12:58pm

Really: there should be no mechanism where the quality of separate AI API calls are interconnected, except by the messages you provide.

If you are not discussing a growing chat context, but instead, the count and rate of API calls, then your report seems like it should be inconceivable, unless OpenAI goes out of their way to degrade the AI model on their most active organizations.

Just a quick observation: a web search call will completely break any application you thought you had instructed. You can’t simply make an AI more truthful because you added more knowledge via search results - OpenAI injects their own new system instruction, taking over, for essentially writing a response that is snippets of page descriptions and a link to a web page: search results. Appeasement to web authors to direct traffic (which might not even work, because it assumes a web browser that can launch links.)

rex.vanhorn · August 4, 2025, 1:26pm

Hi, @phyde1001

Yes. The application starts with a list of companies in Excal; each row is the name and address of the company. The app goes down the Excel list to create a prompt like “Search the web for company {Name} at {address} and answer the following questions:
1…
2…
3…”

I then use the API with the web search tool to submit only the prompt (no previous messages) and then return the response to update the original Excel file.

The app works great, but the quality of responses seems to oscillate throughout the execution of the list. At the start the responses are awesome. Later, they are very lazy.

rex.vanhorn · August 4, 2025, 1:29pm

@_j
“If you are not discussing a growing chat context, but instead, the count and rate of API calls, then your report seems like it should be inconceivable…”

Exactly! That is what I’ve noticed.

rex.vanhorn · August 4, 2025, 1:31pm

Thank you!
I’ve heard other people make the same observation and I was skeptical, but now I’ve seen it for myself.

_j · August 4, 2025, 8:01pm

What endpoint are you using?

If Assistants, are you creating a new thread each time? A previous response ID on Responses?

Try this on Chat Completions, using the gpt-4o-search-preview AI model. Ask your question, receive only an answer based on internet search results.

On that endpoint: There is no chat mechanism that you don’t create yourself. There is a single search, not an AI that can make multiple calls to explore the web and grow the context.

rex.vanhorn · August 5, 2025, 12:24pm

Hi, @_j
I used the Responses endpoint. Here is an example:

Set up the API request

url = "https://api.openai.com/v1/responses"
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}
data = {
    "model": "gpt-4o", # and/or gpt-4o-mini
    "tools": [{"type": "web_search"}],
    "input": prompt,
    "text": {
        "format": {
            "type": "json_schema",
            "name": "company_info",
            "schema": {
                "type": "object",
                "properties": {
                    "mission_statement": {"type": "string"},
                    "question 1": {"type": "string"},
                    "question 2": {"type": "string"},
                    "question 3": {"type": "string"}
                },
                "required": [
                    "mission_statement",
                    "question 1",
                    "question 2",
                    "question 3"
                ],
                "additionalProperties": False
            },
            "strict": True
        },
    }
}

# Send the request to the OpenAI API
response = requests.post(url, headers=headers, json=data)

adlaufmg · August 5, 2025, 5:42pm

Yes, I am experiencing the same problem. Even mixing informations from one doc to another. Even my name was incorrect. In addiction, when a document that I worked hours with gpt, still, it affirms that changes are there but, in fact, they are not.

thenewparadigmmedia · August 7, 2025, 8:13pm

As a writer I’ve noticed that ChatGPT is pretty unreliable. I can feed it a chapter and sometimes while loading into Canvas it rewrites parts of my chapter. It has a lot of trouble maintaining my voice when asked to do minor corrections and sometimes even hallucinates characters and beats. I haven’t noticed any pattern to it, it just seems to happen randomly. I asked it once why that happens and it said that long-form fiction isn’t its specialty. I’ve since moved on to Claude Opus 4.1.

Didi_K · August 8, 2025, 1:07pm

Hi,

I completely agree with you and frequently experience similar situations to the point were sometimes, I want to send ChatGPT to his ‘room’ and reflect on his behavior lol.. I’ve read the other comments and found it very interesting. Please allow me to speak in plain terms, I’m not a tech person, but from my experience, I believe memory, contextual memory as well as relational memory might still be a great limit to the model. It performs really well in relatively ‘short’ conversation but as soon as the interaction reaches a ‘certain limit’ both in terms of length and complexity needing that ability to associate complex ideas and even sometimes reading through documents and need to retrieve very specific and verbatim information for analysis, it either loses its edge or go from great to good. When building long documents with a certain level of complexity, I often build it in chunks/section saving each block on a notepad or other document and making sure that I serve as the long and overarching memory to spot what are missing in the next parts of the section that is being built. All that to say, that there’s still much to do in terms of long term memory, contextual and relational memory while building complex work. My terms are probably not the technical ones but I hope you understand what I mean. Let’s see how GPT -5 will perform! Cheers.

YesYes · August 8, 2025, 3:17pm

Yes, I agree. Having similar experiences. Hoping 5 stability issues are solved soon. It hasn’t been the best experience so far

cmont · August 8, 2025, 3:31pm

This is happening even in new chats without history (not proyects) or documents. So I believe this is more of a bug and desconfig. issue of the new update.

jamilbio20 · August 10, 2025, 7:33pm

give him coffee. worth trying.

Topic		Replies	Views
GPT-4o is stuck in a loop and unusable GPT builders gpt-4	8	6019	January 18, 2025
Loss of logic In the ChatGPT May 3 Version Community gpt-4 , chatgpt	18	2663	December 20, 2023
Chat GPT 4 getting worse? API	8	6094	December 17, 2023
Has regular gpt-4 model changed for the worse by any chance? Community gpt-4 , hallucinations	12	2082	April 23, 2025
Random performance drop / hallucinations during certain periods of time API	16	826	January 2, 2026

Quality Deteriorates as Interactions Continue

Set up the API request

Related topics