2024-12-18

How we made our AI code review bot stop leaving nitpicky comments

Approach to reducing nitpicks (embeddings & KNN)

Many commenters find the final solution (embedding comments and doing KNN-style similarity filtering) plausible, even if “hacky.”
Some note this is effectively a simple classifier; suggest trying other ML models (random forest, XGBoost, small neural nets) on top of embeddings.
Idea of a “universal nit” via averaging embeddings across customers is proposed; authors say they’ll try it and already combine upvoted/downvoted sets to reduce false positives.
Concern raised that clustering might incorrectly suppress comments about specific modules/classes if many prior comments there were downvoted.

Prompting vs post-hoc filtering

Several argue the problem “should” be solvable via better prompting, including:
- Clearer definitions instead of “nits” (e.g., “stylistic/pedantic/trivial comments”).
- Explicit severity labels at the end of responses.
- Chain-of-thought plus tagging nitpicks for removal in a second pass.
Others report similar experiments with severity scores and LLM-as-judge that still misclassified important issues as nitpicks.
Discussion of known failure modes: action bias, long-context confusion, ambiguous wording, conflicting instructions.

What counts as a nitpick?

Strong disagreement on whether the article’s example is actually a nitpick; some see it as important for long-term maintainability.
Many emphasize that nitpickiness is context- and company-dependent, and even the same comment can be trivial in one PR and crucial in another.
Some suspect ego or “ship fast” culture may drive pressure to label valid criticism as nitpicking.

Usefulness of AI code review bots

Mixed views:
- Supporters see value as a first-pass “extra pair of eyes” that catches style, duplication, and obvious problems before human review.
- Critics report high noise, hallucinated issues, and little real benefit compared to linters and human review; fear juniors over-trusting AI feedback.
- Several argue code review is precisely where human judgment, knowledge-sharing, and mentoring are most important.

Linters, alternatives, and metrics

Debate whether AI review adds more than well-tuned linters/formatters; proponents point to more nuanced, context-aware rules.
Others say overly complex rules are themselves the problem.
Metric choice criticized: “percentage of comments addressed” may reward leaving fewer comments; suggestions include normalizing by files or lines changed.

Pricing & incentives

Some see the quoted per-file/per-dev pricing as expensive, especially for lower-wage markets or cash-strapped orgs; others note it’s a small fraction of developer salary.
Commenters highlight that LLMs being billed per token may bias toward verbosity, though competition and user instructions can push toward concision.

Related topics