How we made our AI code review bot stop leaving nitpicky comments

Approach to reducing nitpicks (embeddings & KNN)

  • Many commenters find the final solution (embedding comments and doing KNN-style similarity filtering) plausible, even if “hacky.”
  • Some note this is effectively a simple classifier; suggest trying other ML models (random forest, XGBoost, small neural nets) on top of embeddings.
  • Idea of a “universal nit” via averaging embeddings across customers is proposed; authors say they’ll try it and already combine upvoted/downvoted sets to reduce false positives.
  • Concern raised that clustering might incorrectly suppress comments about specific modules/classes if many prior comments there were downvoted.

Prompting vs post-hoc filtering

  • Several argue the problem “should” be solvable via better prompting, including:
    • Clearer definitions instead of “nits” (e.g., “stylistic/pedantic/trivial comments”).
    • Explicit severity labels at the end of responses.
    • Chain-of-thought plus tagging nitpicks for removal in a second pass.
  • Others report similar experiments with severity scores and LLM-as-judge that still misclassified important issues as nitpicks.
  • Discussion of known failure modes: action bias, long-context confusion, ambiguous wording, conflicting instructions.

What counts as a nitpick?

  • Strong disagreement on whether the article’s example is actually a nitpick; some see it as important for long-term maintainability.
  • Many emphasize that nitpickiness is context- and company-dependent, and even the same comment can be trivial in one PR and crucial in another.
  • Some suspect ego or “ship fast” culture may drive pressure to label valid criticism as nitpicking.

Usefulness of AI code review bots

  • Mixed views:
    • Supporters see value as a first-pass “extra pair of eyes” that catches style, duplication, and obvious problems before human review.
    • Critics report high noise, hallucinated issues, and little real benefit compared to linters and human review; fear juniors over-trusting AI feedback.
    • Several argue code review is precisely where human judgment, knowledge-sharing, and mentoring are most important.

Linters, alternatives, and metrics

  • Debate whether AI review adds more than well-tuned linters/formatters; proponents point to more nuanced, context-aware rules.
  • Others say overly complex rules are themselves the problem.
  • Metric choice criticized: “percentage of comments addressed” may reward leaving fewer comments; suggestions include normalizing by files or lines changed.

Pricing & incentives

  • Some see the quoted per-file/per-dev pricing as expensive, especially for lower-wage markets or cash-strapped orgs; others note it’s a small fraction of developer salary.
  • Commenters highlight that LLMs being billed per token may bias toward verbosity, though competition and user instructions can push toward concision.