2025-06-05

Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

Data quality & interpretation

Merge rate is seen as a very coarse metric:
- Users often don’t even create a PR when an agent’s output is nonsense.
- “Merged” PRs may be heavily edited, or only partially useful (ideas, scaffolding).
- Many agent PRs are tiny or documentation-only, inflating apparent success.
Different tools create PRs at different points:
- Some (e.g., Codex) do most iteration privately and only open a PR when the user is happy, biasing merge rates upward.
- Others (e.g., Copilot agent) open Draft PRs immediately so failures are visible, making merge rates look worse.
Commenters want richer dimensions: PR size, refactor vs dependency bump, test presence, language, complexity, repo popularity, unique repos/orgs.

Coverage of tools and attribution

Multiple people question the absence of Claude Code and Google Jules.
It’s noted that Claude Code can:
- Run in the background, use gh CLI, and GitHub Actions to open PRs.
- Mark commits with “Generated with Claude Code” / “Co‑Authored‑By: Claude,” which could be used for search.
However, Claude Code attribution is configurable and can be disabled, so statistics based on commit text/author may undercount it.
Concern about false positives: branch names like codex/my-branch might be incorrectly attributed if the method is purely naming-based.
Some argue the omission of Claude Code is serious enough to call the current data “wildly inaccurate.”

UX, workflows, and perceived quality

Codex is praised as an “out‑of‑loop” background agent that:
- Works on its own branch, opens PRs, and is used for cleanup tasks, FIXMEs, docs, and exploration.
- Feels like an appliance for well-scoped tasks rather than an intrusive IDE integration.
Cursor and Windsurf:
- Some find them more annoying than ChatGPT, saying they disrupt flow and add little beyond existing IDE autocomplete.
- Many users weren’t aware Cursor can create PRs; its main value is seen as hands-on in-editor assistance, not autonomous PRs.
Copilot agent PRs are called “unusable” by at least one commenter, though others from the same ecosystem stress the value of visible Draft PRs.
One taxonomy proposed:
- “Out of loop” autonomous agents (Codex).
- “In the loop” speed-typing assistants (Cursor/Windsurf), hampered by latency.
- “Coach mode” (ChatGPT-style), for learning and understanding code.

Experiences with Claude Code

Power users describe:
- Running multiple Claude instances autonomously all day on personal projects.
- Detailed TASKS/PLAN docs, QUESTIONS.md workflows, and recursive todo lists that improve reliability.
- Using permissions to auto-approve actions in sandboxed environments.
Disagreements on UX:
- Some complain about constant permission prompts and say it’s not truly autonomous.
- Others respond that Docker, --dangerously-skip-permissions, and “don’t ask again” options solve this, praising its permission model as best-in-class.

Legal and licensing concerns

Substantial discussion on whether fully AI-generated commits are copyrightable:
- Cites a US stance that protection requires “sufficient human expressive elements.”
- Raises implications for GPL/copyleft: AI-generated patches might be effectively public domain but then combined with copyrighted code.
Speculation about:
- Using agents plus comprehensive test suites for “clean room” reimplementation of GPL code.
- The mix of human, machine, and training-data creativity in AI-generated code.
- Vendors offering indemnity to enterprises in exchange for retaining logs and defending infringement claims.

Additional ideas and critiques

Suggestions:
- Track PRs that include tests as a better quality signal.
- Analyze by repo stars and unique repos; a ClickHouse query is shared as an example.
- Have agents cryptographically sign PRs to prevent faked attributions.
Meta-critique:
- Some think the sheer Codex PR volume is “pollution”; others argue this is expected given its design goal.
- Several commenters stress that without understanding human-in-the-loop extent and task difficulty, “performance” rankings are inherently limited.

Related topics