2025-05-20

The behavior of LLMs in hiring decisions: Systemic biases in candidate selection

Observed biases in LLM hiring experiments

Commenters focus on two main findings from the article:
- A consistent preference for female candidates when CV quality is controlled and genders are swapped.
- A strong positional bias toward the candidate listed first in the prompt.
Several note that these are statistically robust results (tens of thousands of trials), not random variation.
Grok and other major models reportedly show similar patterns; DeepSeek V3 is mentioned as somewhat less biased in the tests.

Debate on causes of gender bias

One camp argues the models may be reflecting real-world hiring trends (e.g., efforts to rebalance male-heavy fields), or underlying “left-leaning” cultural norms baked into text data.
Others think the effect is more likely from post-training alignment/RLHF, which aggressively avoids discrimination and may overcorrect toward favoring women and pronoun-displaying candidates.
There’s extended back-and-forth over empirical studies of gender bias in academia, with participants citing conflicting papers and pointing out publication bias and narrative-driven citation patterns.

Positional / context biases

The first-candidate preference is widely seen as the most alarming technical flaw: it suggests LLMs don’t evenly weight context and can base “decisions” on trivial ordering.
People link this to known “lost in the middle” issues and warn that RAG and classification systems may be quietly influenced by such artifacts.
Some propose simple mitigations (randomizing order, multiple runs, blind prompts), but note most real HR users won’t do this.

Suitability of LLMs for hiring

Many argue LLMs should not be used to make hiring decisions at all, only to summarize CVs, translate jargon, or generate pro/con notes for human reviewers.
Others emphasize that LLM outputs are articulate but not grounded reasoning, and numeric scores/probabilities from them are especially untrustworthy.
There’s concern that companies may use biased AI as a “liability shield” for discriminatory outcomes.

Training data, politics, and gaming the system

Multiple comments discuss systemic leftward bias due to overrepresentation of certain professions and platforms in training data, then further amplified by alignment.
Some suggest synthetic data and filtering might reduce extremes, but worry self-generated data could reinforce existing bias.
Recruiters report that LLMs tend to favor LLM-written “optimized” resumes, and people speculate about adversarial tricks (invisible text in PDFs) to manipulate AI screening.

Related topics