GPT-5.3-Codex-Spark: fast model, but /goal state seems fragile after context compaction

I’ve been testing gpt-5.3-codex-spark in Codex on a real repository workflow, and I wanted to share an observation that may be useful for the Codex team or other users building longer-running /goal loops.

First: the model is blazingly quick. For short, bounded tasks it feels very strong. It moves fast, inspects quickly, and can get through small code/test iterations with very little friction.

The issue I’m seeing is with longer /goal workflows where the task depends on remembering already-validated state.

In one recent run, I gave it a goal that required:

  • inspect the repo state
  • make focused test-hardening changes
  • run lint/tests
  • validate the result
  • continue only if something remained unvalidated

It successfully ran a series of lint and test commands. Then, after a context compaction event, it appeared to lose track of the fact that those commands had already completed successfully.

After compaction, it resumed as though it still lacked a “validated test result”, even though it had already produced one earlier in the same /goal run. The practical result was a loop:

run tests
→ tests pass
→ context compacts
→ model no longer trusts / remembers the pass
→ rerun tests
→ still believes validation is missing
→ repeat

This is painful because the failure mode is not that the model cannot do the work. It can. The failure is that the /goal state does not seem robustly survive context compaction.

The model’s small/fast context profile seems to conflict with goal-oriented agentic workflows where the important state is not just “what files changed”, but also:

  • which validation commands have already run
  • exact pass/fail results
  • which findings were already resolved
  • which checks remain genuinely outstanding
  • whether the task is in implementation mode, validation mode, or landing mode

In other words, after compaction the model may remember the broad goal, but lose the execution ledger needed to avoid repeating itself.

The issue is structural - a fast Codex model can become expensive or counterproductive in long /goal workflows if compaction causes it to forget validated state and re-enter loops.

A possible mitigation might be for /goal runs to maintain a compact, durable task ledger outside the ordinary conversational context, something like:

Validated:

  • command A: passed at
  • command B: passed at
  • reviewer X: no blockers
  • git diff checked

Outstanding:

  • command C not yet run
  • PR body not yet generated

Mode:

  • validation / landing / blocker-fix / paused

Then after compaction, the model would have a reliable execution state rather than reconstructing progress from compressed prose.

Another possible user-side workaround is to explicitly force the model to maintain a VALIDATION_LEDGER.md or similar file during long goals, but it feels like this is a core /goal orchestration problem rather than something every user should have to hand-roll.

Has anyone else seen this with gpt-5.3-codex-spark or other smaller/faster Codex models? In particular, have you noticed post-compaction loops where the model repeats tests or validation because it no longer trusts the earlier result?