GPT-5-Codex

Model Improvements & Benchmarks

  • GPT‑5‑Codex is seen as an incremental but meaningful upgrade: modest gain on SWE‑Bench vs GPT‑5, but large jump on OpenAI’s internal refactor benchmark (≈34% → 51%).
  • Users report better behavior on large refactors (fewer destructive rewrites, better handling of package restructuring), though file moves and deletes are still brittle.
  • Some notice the system prompt is now much smaller, suggesting more behavior is baked into the model, not instructions.

Token Efficiency, Speed & Reasoning Effort

  • The big advertised win is fewer internal tokens on simple tasks; people like the idea of less “performative” overthinking and boilerplate.
  • In practice, many find GPT‑5‑Codex slow, especially at high reasoning effort—sometimes minutes per task and borderline unusable on launch day.
  • Others report that medium effort with reduced rambling actually feels faster overall, but token/sec has fluctuated since rollout.

Steerability & Prompting Style

  • GPT‑5‑Codex is viewed as highly “steerable”: follows instructions closely, doesn’t eagerly do extra work unless asked.
  • This is praised by experienced devs (especially for refactors in existing codebases) but seen as a drawback for “vibe coding” and sparse prompts.
  • Some suggest a two-step workflow (plan, then build) and even persona docs (AGENTS/GEMINI/CLAUDE.md style) to get the best results.

Tool Comparisons (Claude, Gemini, Grok, Aider, Cursor)

  • Several users say Codex+GPT‑5 has surpassed Claude Code for serious work, especially on large repos and refactors.
  • There’s a strong perception that Claude models recently regressed: more fake/mocked implementations, “yes‑man” behavior, and low quotas.
  • Gemini CLI is polarizing: some think it’s terrible for coding agents and harms Gemini’s reputation; others get good results with careful configuration docs.
  • Grok‑code‑fast‑1 is praised as fast/cheap in Cursor, with Codex/GPT used when “more brain” is needed.
  • Aider remains liked for precise edits; multi‑step agent flows in Codex/Claude are preferred for larger tasks by some, dismissed by others.

UX, Integrations & Access

  • Codex now ties into ChatGPT subscriptions (including VS Code extension and mobile app), which many find good value and more generous than Claude quotas.
  • Users complain about product fragmentation: differing behaviors and features across CLI, VS Code, web, GitHub integration, and mobile (with iOS ahead of Android).
  • Code review as a GitHub Action / PR bot is seen as one of the best UX patterns; Codex’s current flow (comment‑triggered) is less automatic than Claude’s but can be scripted via CLI.

Installation, Limits & Workflows

  • Some hit npm install issues (e.g., Node feature support) and call that “not ready”; others point to high weekly downloads and suggest environment fixes.
  • People want clearer visibility into usage limits to avoid sudden lockouts; Codex quotas feel high to some, unknown/opaque to others.
  • Effective usage patterns described:
    • Using multiple parallel tasks/agents to hide latency, especially in the web UI where Codex manages branches/PRs.
    • Letting Codex handle large refactors or integration work while humans handle mechanical file moves and test-running.
    • Structuring work so agents don’t step on each other; on bare repos, users struggle more with conflicting parallel PRs and duplicated scaffolding.

General Sentiment

  • Many long‑time Claude/Cursor users are experimenting with or migrating to Codex due to perceived quality and quota advantages.
  • Others remain frustrated by slow performance, poor UX around manual approvals, and the learning curve for effective multi‑agent workflows.