Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

Data quality & interpretation

  • Merge rate is seen as a very coarse metric:
    • Users often don’t even create a PR when an agent’s output is nonsense.
    • “Merged” PRs may be heavily edited, or only partially useful (ideas, scaffolding).
    • Many agent PRs are tiny or documentation-only, inflating apparent success.
  • Different tools create PRs at different points:
    • Some (e.g., Codex) do most iteration privately and only open a PR when the user is happy, biasing merge rates upward.
    • Others (e.g., Copilot agent) open Draft PRs immediately so failures are visible, making merge rates look worse.
  • Commenters want richer dimensions: PR size, refactor vs dependency bump, test presence, language, complexity, repo popularity, unique repos/orgs.

Coverage of tools and attribution

  • Multiple people question the absence of Claude Code and Google Jules.
  • It’s noted that Claude Code can:
    • Run in the background, use gh CLI, and GitHub Actions to open PRs.
    • Mark commits with “Generated with Claude Code” / “Co‑Authored‑By: Claude,” which could be used for search.
  • However, Claude Code attribution is configurable and can be disabled, so statistics based on commit text/author may undercount it.
  • Concern about false positives: branch names like codex/my-branch might be incorrectly attributed if the method is purely naming-based.
  • Some argue the omission of Claude Code is serious enough to call the current data “wildly inaccurate.”

UX, workflows, and perceived quality

  • Codex is praised as an “out‑of‑loop” background agent that:
    • Works on its own branch, opens PRs, and is used for cleanup tasks, FIXMEs, docs, and exploration.
    • Feels like an appliance for well-scoped tasks rather than an intrusive IDE integration.
  • Cursor and Windsurf:
    • Some find them more annoying than ChatGPT, saying they disrupt flow and add little beyond existing IDE autocomplete.
    • Many users weren’t aware Cursor can create PRs; its main value is seen as hands-on in-editor assistance, not autonomous PRs.
  • Copilot agent PRs are called “unusable” by at least one commenter, though others from the same ecosystem stress the value of visible Draft PRs.
  • One taxonomy proposed:
    • “Out of loop” autonomous agents (Codex).
    • “In the loop” speed-typing assistants (Cursor/Windsurf), hampered by latency.
    • “Coach mode” (ChatGPT-style), for learning and understanding code.

Experiences with Claude Code

  • Power users describe:
    • Running multiple Claude instances autonomously all day on personal projects.
    • Detailed TASKS/PLAN docs, QUESTIONS.md workflows, and recursive todo lists that improve reliability.
    • Using permissions to auto-approve actions in sandboxed environments.
  • Disagreements on UX:
    • Some complain about constant permission prompts and say it’s not truly autonomous.
    • Others respond that Docker, --dangerously-skip-permissions, and “don’t ask again” options solve this, praising its permission model as best-in-class.

Legal and licensing concerns

  • Substantial discussion on whether fully AI-generated commits are copyrightable:
    • Cites a US stance that protection requires “sufficient human expressive elements.”
    • Raises implications for GPL/copyleft: AI-generated patches might be effectively public domain but then combined with copyrighted code.
  • Speculation about:
    • Using agents plus comprehensive test suites for “clean room” reimplementation of GPL code.
    • The mix of human, machine, and training-data creativity in AI-generated code.
    • Vendors offering indemnity to enterprises in exchange for retaining logs and defending infringement claims.

Additional ideas and critiques

  • Suggestions:
    • Track PRs that include tests as a better quality signal.
    • Analyze by repo stars and unique repos; a ClickHouse query is shared as an example.
    • Have agents cryptographically sign PRs to prevent faked attributions.
  • Meta-critique:
    • Some think the sheer Codex PR volume is “pollution”; others argue this is expected given its design goal.
    • Several commenters stress that without understanding human-in-the-loop extent and task difficulty, “performance” rankings are inherently limited.