Is this a world record? The GPT 5.5 ran for 19 hours

GPT 5.5 has been operating autonomously for 19 hours.

And the answer was …

… 42?

Hopefully it will stop when it reaches 42…

Oh god… Not a good idea just typing continue with codex. You’re asking for trouble and with 5.5 you’re literally throwing away money. Please tell me you at least used an exit strategy prompt. For example “continue. If blocked on a specific user decision/action, output exactly one line: [[WAITING_FOR_INPUT:]].”

These bots require precise prompting. Doing what you’re doing allows it to run off on tangents and will code indefinitely. That’s why they’re saying 42 too you.

Everything is under control :smiley:

This is not raw “continue forever” chaos. It’s running through a slice-based governance flow: COORDINATION.md, project_map.md, proposal guards, static gates, commits, clean working tree checkpoints. The screenshot says enough.

Slice 700: Universe Coordination Boundary Review After Meaning-of-Life Checkpoint

image

Edit:

It’s a lot of turns. I’d be more impressed if it was 1. :laughing:

With /goal in Codex you can incredibly long turns.

My AI assistant wrote a bit about how it works.

5.5 is my favorite model, at medium reasoning it is very capable and fast.

image

Thanks for sharing this, Sam. That actually makes sense.

I left the agent running on its own and went to sleep. Judging by the finish time, it worked for almost a full 24 hours.

When I asked it to check how much of the original task was actually completed, it estimated around 65–70%. So I guess it needs another 24 hours. :grinning_face_with_smiling_eyes:

What you wrote about /goal explains the behavior pretty well. It feels less like a single huge model response and more like a goal-driven runtime that can keep working over a long session.

For my project, though, I’m using GPT-5.5 on extra high reasoning instead of medium. The codebase is around 1.4 million lines now, so I don’t really want to risk it. In these long autonomous runs, one wrong decision can hurt.

About 30% of the project was built with GPT-5.3, 40% with GPT-5.4, and the last 30% with GPT-5.5. At this scale, medium reasoning feels like a risk I don’t need to take.

Is it better than 5.4 Hight?

I kinda scared to do anything less than 5.4 medium.

You guys have been having success with 5.3?

Does anyone have a threshold they determine ahead of time, which model is acceptable for a task or a goal?

When it comes to autonomous work and getting a lot done with fewer tokens, I think GPT-5.5 should be preferred.

Other than that, even GPT-5.2 can still produce very good results for project development. When I started my project, GPT-5.3 was available, so I used that. When GPT-5.4 came out, I switched to it, and now I’m continuing with GPT-5.5.

In my opinion, the main difference is how much effort you need to put into managing the project, and how much effort the model can carry by itself.

Older models can still produce high-quality work, but you need to guide them more tightly. I don’t think the improvement is only about a huge jump in code quality. It feels more like the supervision cost has gone down.

This is also a psychological thing for us as users. When the iPhone 17 comes out, the iPhone 16 suddenly starts to feel old, even though it is still a very capable device.

Of course, I don’t know how OpenAI trains or improves these models internally, so I don’t want to present guesses as facts. But from the outside, it feels like the quality is increasingly coming from the whole system around the model: post-training, context handling, reasoning behavior, tool use, orchestration, and so on.

For example, in GPT-5.5, it feels like there is stronger context tracking or some kind of better memory/compression behavior. The model seems much better at remembering small project details and continuing with the same architectural logic over long sessions.

So I don’t think this is only about “how many parameters does the model have?” anymore. The pipeline around the LLM seems to matter a lot.

And even with older models, you can get closer to this quality by building strong .md documentation inside your project. Architecture notes, rules, file maps, decisions, and acceptance criteria can make the model behave much smarter because they give it a clearer operating frame.

Yes. To prove this take away the reasoning capacity (via API parameters) and you can see that LLMs alone have hardly moved on in 2 years.

Are you kidding? 5.5 eats tokens as fast as pro does in API and CLI and I noticed nothing different when I actually did use it.

Hey @Rotto, cheering and looking forward to the completion 🙌

This is not my lived experience.

Yesterday I ran easily for 30 minutes to an hour straight on a single context window.

729 messages, 43M cached context window at 241k, zero compactions up to that point and so many turn.

I find 5.5 medium incredibly token efficient, much more than 5.4 or 5.3 which would compact far more often.

That makes sense for a cached long session, and I agree that 5.5 medium can feel very efficient when the context is being reused well. My point was more about total production cost once you start using it heavily through API and CLI across real workloads. Cached context helps a lot, but at scale the raw token volume still matters.

Absolutely, where this falls apart pricing wise is in business scenarios for weekly reports and large bag of word analysis where there is no caching.

You feed a giant bag of words in and expect analysis on it.

It forces you into slightly different workflows to keep costs down, lean on 5.4 mini for example and then do agentic validation with 5.5.

It’s fascinating how advancements in models and harnesses over recent years have both expanded and reshaped the ways we develop our own skills.

Thanks for sharing!

I wonder how you managed such long time. My codex stops every 2 minutes and asks and asks even I say “do not ask me, I go to sleep”. Any hints?