The behavior of LLMs in hiring decisions: Systemic biases in candidate selection
Observed biases in LLM hiring experiments
- Commenters focus on two main findings from the article:
- A consistent preference for female candidates when CV quality is controlled and genders are swapped.
- A strong positional bias toward the candidate listed first in the prompt.
- Several note that these are statistically robust results (tens of thousands of trials), not random variation.
- Grok and other major models reportedly show similar patterns; DeepSeek V3 is mentioned as somewhat less biased in the tests.
Debate on causes of gender bias
- One camp argues the models may be reflecting real-world hiring trends (e.g., efforts to rebalance male-heavy fields), or underlying “left-leaning” cultural norms baked into text data.
- Others think the effect is more likely from post-training alignment/RLHF, which aggressively avoids discrimination and may overcorrect toward favoring women and pronoun-displaying candidates.
- There’s extended back-and-forth over empirical studies of gender bias in academia, with participants citing conflicting papers and pointing out publication bias and narrative-driven citation patterns.
Positional / context biases
- The first-candidate preference is widely seen as the most alarming technical flaw: it suggests LLMs don’t evenly weight context and can base “decisions” on trivial ordering.
- People link this to known “lost in the middle” issues and warn that RAG and classification systems may be quietly influenced by such artifacts.
- Some propose simple mitigations (randomizing order, multiple runs, blind prompts), but note most real HR users won’t do this.
Suitability of LLMs for hiring
- Many argue LLMs should not be used to make hiring decisions at all, only to summarize CVs, translate jargon, or generate pro/con notes for human reviewers.
- Others emphasize that LLM outputs are articulate but not grounded reasoning, and numeric scores/probabilities from them are especially untrustworthy.
- There’s concern that companies may use biased AI as a “liability shield” for discriminatory outcomes.
Training data, politics, and gaming the system
- Multiple comments discuss systemic leftward bias due to overrepresentation of certain professions and platforms in training data, then further amplified by alignment.
- Some suggest synthetic data and filtering might reduce extremes, but worry self-generated data could reinforce existing bias.
- Recruiters report that LLMs tend to favor LLM-written “optimized” resumes, and people speculate about adversarial tricks (invisible text in PDFs) to manipulate AI screening.