How we made our AI code review bot stop leaving nitpicky comments
Approach to reducing nitpicks (embeddings & KNN)
- Many commenters find the final solution (embedding comments and doing KNN-style similarity filtering) plausible, even if “hacky.”
- Some note this is effectively a simple classifier; suggest trying other ML models (random forest, XGBoost, small neural nets) on top of embeddings.
- Idea of a “universal nit” via averaging embeddings across customers is proposed; authors say they’ll try it and already combine upvoted/downvoted sets to reduce false positives.
- Concern raised that clustering might incorrectly suppress comments about specific modules/classes if many prior comments there were downvoted.
Prompting vs post-hoc filtering
- Several argue the problem “should” be solvable via better prompting, including:
- Clearer definitions instead of “nits” (e.g., “stylistic/pedantic/trivial comments”).
- Explicit severity labels at the end of responses.
- Chain-of-thought plus tagging nitpicks for removal in a second pass.
- Others report similar experiments with severity scores and LLM-as-judge that still misclassified important issues as nitpicks.
- Discussion of known failure modes: action bias, long-context confusion, ambiguous wording, conflicting instructions.
What counts as a nitpick?
- Strong disagreement on whether the article’s example is actually a nitpick; some see it as important for long-term maintainability.
- Many emphasize that nitpickiness is context- and company-dependent, and even the same comment can be trivial in one PR and crucial in another.
- Some suspect ego or “ship fast” culture may drive pressure to label valid criticism as nitpicking.
Usefulness of AI code review bots
- Mixed views:
- Supporters see value as a first-pass “extra pair of eyes” that catches style, duplication, and obvious problems before human review.
- Critics report high noise, hallucinated issues, and little real benefit compared to linters and human review; fear juniors over-trusting AI feedback.
- Several argue code review is precisely where human judgment, knowledge-sharing, and mentoring are most important.
Linters, alternatives, and metrics
- Debate whether AI review adds more than well-tuned linters/formatters; proponents point to more nuanced, context-aware rules.
- Others say overly complex rules are themselves the problem.
- Metric choice criticized: “percentage of comments addressed” may reward leaving fewer comments; suggestions include normalizing by files or lines changed.
Pricing & incentives
- Some see the quoted per-file/per-dev pricing as expensive, especially for lower-wage markets or cash-strapped orgs; others note it’s a small fraction of developer salary.
- Commenters highlight that LLMs being billed per token may bias toward verbosity, though competition and user instructions can push toward concision.