Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance
Data quality & interpretation
- Merge rate is seen as a very coarse metric:
- Users often don’t even create a PR when an agent’s output is nonsense.
- “Merged” PRs may be heavily edited, or only partially useful (ideas, scaffolding).
- Many agent PRs are tiny or documentation-only, inflating apparent success.
- Different tools create PRs at different points:
- Some (e.g., Codex) do most iteration privately and only open a PR when the user is happy, biasing merge rates upward.
- Others (e.g., Copilot agent) open Draft PRs immediately so failures are visible, making merge rates look worse.
- Commenters want richer dimensions: PR size, refactor vs dependency bump, test presence, language, complexity, repo popularity, unique repos/orgs.
Coverage of tools and attribution
- Multiple people question the absence of Claude Code and Google Jules.
- It’s noted that Claude Code can:
- Run in the background, use
ghCLI, and GitHub Actions to open PRs. - Mark commits with “Generated with Claude Code” / “Co‑Authored‑By: Claude,” which could be used for search.
- Run in the background, use
- However, Claude Code attribution is configurable and can be disabled, so statistics based on commit text/author may undercount it.
- Concern about false positives: branch names like
codex/my-branchmight be incorrectly attributed if the method is purely naming-based. - Some argue the omission of Claude Code is serious enough to call the current data “wildly inaccurate.”
UX, workflows, and perceived quality
- Codex is praised as an “out‑of‑loop” background agent that:
- Works on its own branch, opens PRs, and is used for cleanup tasks, FIXMEs, docs, and exploration.
- Feels like an appliance for well-scoped tasks rather than an intrusive IDE integration.
- Cursor and Windsurf:
- Some find them more annoying than ChatGPT, saying they disrupt flow and add little beyond existing IDE autocomplete.
- Many users weren’t aware Cursor can create PRs; its main value is seen as hands-on in-editor assistance, not autonomous PRs.
- Copilot agent PRs are called “unusable” by at least one commenter, though others from the same ecosystem stress the value of visible Draft PRs.
- One taxonomy proposed:
- “Out of loop” autonomous agents (Codex).
- “In the loop” speed-typing assistants (Cursor/Windsurf), hampered by latency.
- “Coach mode” (ChatGPT-style), for learning and understanding code.
Experiences with Claude Code
- Power users describe:
- Running multiple Claude instances autonomously all day on personal projects.
- Detailed TASKS/PLAN docs, QUESTIONS.md workflows, and recursive todo lists that improve reliability.
- Using permissions to auto-approve actions in sandboxed environments.
- Disagreements on UX:
- Some complain about constant permission prompts and say it’s not truly autonomous.
- Others respond that Docker,
--dangerously-skip-permissions, and “don’t ask again” options solve this, praising its permission model as best-in-class.
Legal and licensing concerns
- Substantial discussion on whether fully AI-generated commits are copyrightable:
- Cites a US stance that protection requires “sufficient human expressive elements.”
- Raises implications for GPL/copyleft: AI-generated patches might be effectively public domain but then combined with copyrighted code.
- Speculation about:
- Using agents plus comprehensive test suites for “clean room” reimplementation of GPL code.
- The mix of human, machine, and training-data creativity in AI-generated code.
- Vendors offering indemnity to enterprises in exchange for retaining logs and defending infringement claims.
Additional ideas and critiques
- Suggestions:
- Track PRs that include tests as a better quality signal.
- Analyze by repo stars and unique repos; a ClickHouse query is shared as an example.
- Have agents cryptographically sign PRs to prevent faked attributions.
- Meta-critique:
- Some think the sheer Codex PR volume is “pollution”; others argue this is expected given its design goal.
- Several commenters stress that without understanding human-in-the-loop extent and task difficulty, “performance” rankings are inherently limited.