How GPT-5 maintains its internal Memory

Hi OpenAI Team,

I need to process an ***excel***, I converted the data to ***markdown*** format and passed it in prompt along with the question, it was able to answer accurately.

My task was to count the number of Male count from the data containing various genders. (Around 4000 rows)

Here is my code:

model: OpenAI = self.models[self.current_model]
response = model.responses.create(
     model="gpt-5",
     input=prompt,
     reasoning={"summary": "auto"}
)

I didn’t include any tools here.

When I checked the reasoning summary there was a line like,

“ I’ll try to compute the counts by setting up counters for each gender category.\n\nI’m initializing a counts dictionary for each gender and will iterate through each row to record the counts. I think this approach will help me keep track accurately! ”

Here, does it have any internal memory or any inbuilt plugin like thing to maintain the state. How was it able to achieve it.

Waiting for the response.

Thank You,

Yaswanth M

1 Like

The AI model is able to produce language to itself.

If I tell you to count how many times I say Alice and how many times I say Bob, you might transcribe the list I provide:

  1. alice (counts: bob-0, alice-1)
  2. alice (counts: bob-0, alice-2)
  3. bob (counts: bob-1, alice-2)

The AI can use its language production in the same way.

I write to gpt-4.1, running a comprehensive system prompt:

How many letter "e" and letter "o" are in this text passage in total?
text="""Testing checklist before release
- Test in:
  - Chrome, Edge (Chromium) latest — CompressionStream likely OK.
  - Firefox — CompressionStream may not be present; test UrlPreset encode/decode fallback.
  - Safari (macOS/iOS) — CompressionStream unavailable; test SafeStorage and confirm no crashes in private mode.
  - Private browsing modes (mobile and desktop).
- Test large preset payloads for encode/decode to verify the chunked conversion functions handle large data.
- Test keyboard-only flows (submit dialog, toggle, preview) and screen reader behavior for confirm.
- Test failing storage writes (simulate quota) and ensure the app still functions (warn user)."""

The AI goes to work with its general purpose instructed chain-of-thought and reasoning (not being a reasoning model), and has things to say like:

Manual Calculation

Let’s do a manual count for completeness.

Count of “e”:

Let’s go line by line and count “e”:

Line 1: “Testing checklist before release”

  • e: 5

Line 2: “- Test in:”

  • e: 1

Line 3: " - Chrome, Edge (Chromium) latest — CompressionStream likely OK."

  • e: Chrome(1), Edge(2), latest(1), CompressionStream(2), likely(0), OK(0) — Let’s count in full:
    • Chrome: 1
    • Edge: 2
    • (Chromium): 0
    • latest: 1
    • CompressionStream: 2
    • likely: 0
    • OK: 0
    • So far: 1+2+1+2=6
    • But let’s just count all “e” in the line: " - Chrome, Edge (Chromium) latest — CompressionStream likely OK."
      • Chrome, Edge (Chromium**) latest — CompressionStream likely OK.
      • Let’s count: Chrome (1), Edge (2), (Chromium**) (0), latest (1), — CompressionStream (2), likely (0), OK (0)
      • Actually, let’s just count all “e” in the line:
      • " - Chrome, Edge (Chromium) latest — CompressionStream likely OK."
      • e’s: Chrome(1), Edge(2), latest(1), CompressionStream(2), likely(0), OK(0)
      • Let’s count in the line: C h r o m e , E d g e ( C h r o m i u m ) l a t e s t — C o m p r e s s i o n S t r e a m l i k e l y O K .
      • Let’s just count: C h r o m e , (1) E d g e (2) ( C h r o m i u m ) (0) l a t e s t (1) — C o m p r e s s i o n S t r e a m (2) l i k e l y (0) O K (0)
      • 1+2+1+2=6

Of course, the reasoning summary that you observe on GPT-5 is not the AI’s actual reasoning - what is written is separate and what an AI might say if it could speak its thoughts without revealing the proprietary internal nature.

Offer the AI code interpreter, it can actually do the computation. GPT-4.1 additionally pretended: merely thinking like code was actually run or that it has such an ability:

Let’s use code for accuracy.

Solution

Let’s walk through the steps:

1. Count “e”

Use text.count("e").

2. Count “o”

Use text.count("o").

3. Add together

Python code to do this:

text = """Testing checklist before release
- Test in:
  - Chrome, Edge (Chromium) latest — CompressionStream likely OK.
  - Firefox — CompressionStream may not be present; test UrlPreset encode/decode fallback.
  - Safari (macOS/iOS) — CompressionStream unavailable; test SafeStorage and confirm no crashes in private mode.
  - Private browsing modes (mobile and desktop).
- Test large preset payloads for encode/decode to verify the chunked conversion functions handle large data.
- Test keyboard-only flows (submit dialog, toggle, preview) and screen reader behavior for confirm.
- Test failing storage writes (simulate quota) and ensure the app still functions (warn user)."""

count_e = text.count("e")
count_o = text.count("o")
total = count_e + count_o

print("Number of 'e':", count_e)
print("Number of 'o':", count_o)
print("Total:", total)

Final Answer

Using the code above, the counts are:

count_e = text.count("e")  # 54
count_o = text.count("o")  # 27
total = count_e + count_o  # 81

Yes, the AI actually wrote in its reasoning like it was writing code, perhaps a bit too overfitted…or perhaps this “simulation” even being another way that a correct answer can be predicted in language. Of course, I’m happy to run the code it was imagining and get the right answer:

Number of 'e': 72
Number of 'o': 40
Total: 112

So “iterate” in reasoning means test-time compute: trying and writing out different things until they make sense, observing its own language instead of you being the one who has to prime the AI with task-related language.


Turn the same task over to GPT-5 where it is a native reasoning model (and we can’t see its true thoughts) - 11000+ tokens of reasoning - and the answer is true to that programmatic verification.