The behavior of LLMs in hiring decisions: Systemic biases in candidate selection

Observed biases in LLM hiring experiments

  • Commenters focus on two main findings from the article:
    • A consistent preference for female candidates when CV quality is controlled and genders are swapped.
    • A strong positional bias toward the candidate listed first in the prompt.
  • Several note that these are statistically robust results (tens of thousands of trials), not random variation.
  • Grok and other major models reportedly show similar patterns; DeepSeek V3 is mentioned as somewhat less biased in the tests.

Debate on causes of gender bias

  • One camp argues the models may be reflecting real-world hiring trends (e.g., efforts to rebalance male-heavy fields), or underlying “left-leaning” cultural norms baked into text data.
  • Others think the effect is more likely from post-training alignment/RLHF, which aggressively avoids discrimination and may overcorrect toward favoring women and pronoun-displaying candidates.
  • There’s extended back-and-forth over empirical studies of gender bias in academia, with participants citing conflicting papers and pointing out publication bias and narrative-driven citation patterns.

Positional / context biases

  • The first-candidate preference is widely seen as the most alarming technical flaw: it suggests LLMs don’t evenly weight context and can base “decisions” on trivial ordering.
  • People link this to known “lost in the middle” issues and warn that RAG and classification systems may be quietly influenced by such artifacts.
  • Some propose simple mitigations (randomizing order, multiple runs, blind prompts), but note most real HR users won’t do this.

Suitability of LLMs for hiring

  • Many argue LLMs should not be used to make hiring decisions at all, only to summarize CVs, translate jargon, or generate pro/con notes for human reviewers.
  • Others emphasize that LLM outputs are articulate but not grounded reasoning, and numeric scores/probabilities from them are especially untrustworthy.
  • There’s concern that companies may use biased AI as a “liability shield” for discriminatory outcomes.

Training data, politics, and gaming the system

  • Multiple comments discuss systemic leftward bias due to overrepresentation of certain professions and platforms in training data, then further amplified by alignment.
  • Some suggest synthetic data and filtering might reduce extremes, but worry self-generated data could reinforce existing bias.
  • Recruiters report that LLMs tend to favor LLM-written “optimized” resumes, and people speculate about adversarial tricks (invisible text in PDFs) to manipulate AI screening.