What on earth is the story with Codex and Github?

I really want to like Codex and obviously you guys are pouring a tonne of capital resources into it. But the reality is that using it as a professional is very hard.

1. What is going on with PRs and naming?

Obviously naming isn’t OpenAI’s forte and in models it’s kind of funny. But in branches it’s not. Branches are created with arbitrary names, it’s super unclear at any given point whether you’re opening a new branch or updating an existing one. Sometimes you’re forced to switch to a new branch half way through.

Writing code is fast but following up what’s going on on github is a nightmare.

2. If you use GitHub then that’s where you want to interface with Codex

I guess someone decreed that you should be able to use Codex from mobile. Which is great. And it’s nice that I can get a job going late at night from my phone and wake up to it delivered in the morning.

But which branch or PR that job was delivered to is a mystery. There’s not often an explanation as to what was done or why. And since Codex is at best lazy and at worst completely ignores the job it was supposed to do, the first thing you need to tell it in the morning may be to do the job it was drafted to do the previous night.

The UX to do code reviews, to suggest changes, to see diffs between code reviews and suggest follow ups already exists on Github.

Anthropic has the right idea. I prefer coding locally with your models. I REALLY like coding with 4.1 and I love the idea of free usage of o3 remotely but the reality is that both the codex model and the tooling around it feel way inferior to coding locally with Windsurf and 4.1 (even though Codex is supposed to be using a larger model).

3. Why does Codex reinvent the environment setup file?

The Codex setup file doesn’t live in your filesystem; it’s copied and pasted into Codex so there’s no source control. You can’t view the changes easily at Github and when you open multiple branches it’s anyone’s guess which one is which. You can’t test your setup file so if it breaks then you have to do the whole thing over again (and did I mention it’s not in source control?). And it’s entirely proprietary.

Many of us have already defined such environment files in CI scripts and docker files. Why ask for it to be done again inside your own environment? It’s requires more time and is another decoupled source of maintenance.

It feels like this was all developed by developers who don’t need to integrate with other services or manage their own CI. But if you have postgres and redis and queues and frontend builds and full stack tests then this is hard.

4. Codex is often lazy

It is SO HARD to get Codex to actually do the work that you ask it to do. I find this astonishing because 4.1 in Windsurf is incredibly good at following instructions. And o3 is very powerful. And yet Codex feels neither. It frequently totally ignores instructions. Or it decides the job is too long and quits without doing any of it.

Codex needs to break tasks down, to create todo lists, to follow those through and when it doesn’t complete them to suggest spinning up another job. It needs to have self awareness to know whether it’s actually done the work involved. And the humility to apologise and offer up suggestions of change when it hasn’t.

I really really want to like Codex but right now it feels very rough

But I hope you get it right as you have some amazing models (4.1 is really exceptional) and this is clearly the future of software development.

Peter

Here’s a case in point. I have three jobs open at Codex:

For some of these I have requested the codex task to refine its original task and do more work. Now ideally the result of that would be either the same PR updated with thew new code OR a new PR opened to pull the new batch of code into the original branch. And if there is a new PR I’d expect it to have naming that reflects the PR it’s being merged into.

Instead, these are the PRs that Codex opens up for me:

  • Document mail interface status
  • feat: add docs viewer
  • docs: expand angular docs

And it’s also created four (4!) new branches:

  1. d92ewq-codex/document-full-angular-project-angular-frontend
  2. codex/document-legacy-rails-email-interface
  3. yr5f5x-codex/document-legacy-rails-email-interface
  4. codex/create-documentation-viewer-and-compile-script

There is so little consistency. The original Codex requests do, just about, seem to match the branch names - although what do we think the matching is between branch 2 and branch 3? I assume that branch 3 is probably a refinement of branch 2 and meant to be merged into it but frankly I’ve no idea.

But can you tell with confidence which PR title matches which task issued at Codex? I can’t. The tasks are all subtly different and because the PRs have been arbitrarily named I can’t be sure which PR relates to which task.

And then finally I have to match up the PRs to the branches. Since neither the branch names tell me much nor do the PR titles I’m really starting to get stuck now.

I know you have dreams of people firing off a job a minute but I cannot comprehend in what world that might be happening. You can’t even follow what’s going on between three of them!

I’m off to try Claude Code instead. I’ve got a lot of loyalty to OpenAI but the current state of Codex really is a mess.

Hi, I just wanted to confirm what you found out :

  • Codex and PR = nightmare
  • Codex is super lazy

I find that both Openai and Anthropic vastly “embellish” reality. Their tech surely are great - they allow the public to have access to incredible models - but their marketing is borderline malpractice.

Regarding Claude Code, I was really angry, because they implied that the 100$ / 200$ models are usable in real productions, which won’t be the case until the context length is at least 500k. For the moment, you need so much prompt engineering that it is just easier to use a normal LLM. Gemini 2.5 with its 1M tokens context length is a fantastic experience for coding : it doesn’t need to launch 50 tools to search files and read them partially, it just reads what is in the context; though it ends up having problems with attention.

Anthropic toned tone a little bit their marketing claims, and they allowed Claude Code in their 20$ plan, roughly at the same time than Openai did it with Codex and Plus users, so … we are left with :

  • one of the agents who easily goes on an adventure when asked for a feature, is great at planning and understanding the big picture, but forgets everything after 2 autocompacts while victoriously hallucinating that “perfect ! everything is implemented”
  • and another one who makes 3 PR’s for every semicolon changed, and does half steps even if we insist on climbing a flight of stairs. I ended up just using the patch feature (applying the modifications on a local EDI through a copy-paste patch).

There is still a long way to go for those agents to be usable like they advertise, in real productions. The “cash grab scam” period of Anthropic really damaged my confidence in that enterprise, but to be honest, Openai also had Codex exclusively on their 200$ plan for a few weeks.

I’ll just wait for those techs to improve significantly before being a cash cow again.