2025-09-15

GPT-5-Codex

Model Improvements & Benchmarks

GPT‑5‑Codex is seen as an incremental but meaningful upgrade: modest gain on SWE‑Bench vs GPT‑5, but large jump on OpenAI’s internal refactor benchmark (≈34% → 51%).
Users report better behavior on large refactors (fewer destructive rewrites, better handling of package restructuring), though file moves and deletes are still brittle.
Some notice the system prompt is now much smaller, suggesting more behavior is baked into the model, not instructions.

Token Efficiency, Speed & Reasoning Effort

The big advertised win is fewer internal tokens on simple tasks; people like the idea of less “performative” overthinking and boilerplate.
In practice, many find GPT‑5‑Codex slow, especially at high reasoning effort—sometimes minutes per task and borderline unusable on launch day.
Others report that medium effort with reduced rambling actually feels faster overall, but token/sec has fluctuated since rollout.

Steerability & Prompting Style

GPT‑5‑Codex is viewed as highly “steerable”: follows instructions closely, doesn’t eagerly do extra work unless asked.
This is praised by experienced devs (especially for refactors in existing codebases) but seen as a drawback for “vibe coding” and sparse prompts.
Some suggest a two-step workflow (plan, then build) and even persona docs (AGENTS/GEMINI/CLAUDE.md style) to get the best results.

Tool Comparisons (Claude, Gemini, Grok, Aider, Cursor)

Several users say Codex+GPT‑5 has surpassed Claude Code for serious work, especially on large repos and refactors.
There’s a strong perception that Claude models recently regressed: more fake/mocked implementations, “yes‑man” behavior, and low quotas.
Gemini CLI is polarizing: some think it’s terrible for coding agents and harms Gemini’s reputation; others get good results with careful configuration docs.
Grok‑code‑fast‑1 is praised as fast/cheap in Cursor, with Codex/GPT used when “more brain” is needed.
Aider remains liked for precise edits; multi‑step agent flows in Codex/Claude are preferred for larger tasks by some, dismissed by others.

UX, Integrations & Access

Codex now ties into ChatGPT subscriptions (including VS Code extension and mobile app), which many find good value and more generous than Claude quotas.
Users complain about product fragmentation: differing behaviors and features across CLI, VS Code, web, GitHub integration, and mobile (with iOS ahead of Android).
Code review as a GitHub Action / PR bot is seen as one of the best UX patterns; Codex’s current flow (comment‑triggered) is less automatic than Claude’s but can be scripted via CLI.

Installation, Limits & Workflows

Some hit npm install issues (e.g., Node feature support) and call that “not ready”; others point to high weekly downloads and suggest environment fixes.
People want clearer visibility into usage limits to avoid sudden lockouts; Codex quotas feel high to some, unknown/opaque to others.
Effective usage patterns described:
- Using multiple parallel tasks/agents to hide latency, especially in the web UI where Codex manages branches/PRs.
- Letting Codex handle large refactors or integration work while humans handle mechanical file moves and test-running.
- Structuring work so agents don’t step on each other; on bare repos, users struggle more with conflicting parallel PRs and duplicated scaffolding.

General Sentiment

Many long‑time Claude/Cursor users are experimenting with or migrating to Codex due to perceived quality and quota advantages.
Others remain frustrated by slow performance, poor UX around manual approvals, and the learning curve for effective multi‑agent workflows.

Related topics