New benchmark shows top LLMs struggle in real mental health care
Benchmark design & main findings
- MindEval simulates multi-turn patient–clinician conversations and scores them along multiple clinical dimensions on a 1–6 scale.
- All frontier models tested (including latest GPT, Claude, Gemini) averaged below 4/6, with performance worsening for severe symptoms and longer (40‑turn) conversations.
- Larger or “reasoning” models did not consistently beat smaller ones on therapeutic quality.
- Patient simulations and an LLM “judge” were calibrated and shown to have medium–high correlation with human clinician ratings, according to the authors.
Prompting & evaluation methodology
- The same prompts were used across all models to keep comparisons fair; prompts and code are open-sourced.
- Some commenters argue a single prompt per model is not enough because models are highly prompt‑sensitive; others stress any fair benchmark must hold prompts constant.
- The authors intend to further improve both the judge and patient simulators, likely via fine‑tuning.
Human baseline & “struggle” framing
- Multiple people question the absence of a human-therapist control, especially from mainstream online therapy platforms, and say results can’t support claims about absolute “goodness” of care.
- The authors emphasize they are benchmarking LLMs, not comparing them to humans; they argue “room for improvement” is evident from the mid‑range scores alone.
- Several commenters criticize the wording “struggle in real mental health care,” saying that without outcome data or a human baseline, labeling sub‑4/6 as “struggling” is value-laden.
Skepticism about LLM‑based evals
- Some worry about “LLMs all the way down”: simulated patients and LLM judges risk converging on an internally consistent but human‑irrelevant notion of mental health.
- One commenter calls the work essentially “AI scoring AI conversations,” lacking real‑world clinical data; others still see value in a transparent starting point for evaluation.
Debate: should LLMs do mental health work at all?
- Critics call LLM therapy “self‑evidently a terrible idea,” highlighting past chatbot‑linked suicides and the risk that “something” can be worse than “nothing” if it reinforces psychosis or self‑harm.
- Supporters note that people are already using chatbots for distress, driven by access, cost, availability, and reduced shame compared to human therapists. They argue we must at least measure and improve safety.
Comparisons with human therapy
- Several note many human therapists are mediocre or harmful; experiences range from life‑changing help to years of ineffective CBT.
- There is disagreement over whether empathy is essential; some claim objective, even low‑empathy clinicians can still be effective, while others insist relational compassion is irreplaceable.
- Some suggest LLMs might eventually excel at mirroring and text‑based psychodiagnosis, while others say models remain too shallow, brittle, and sycophantic to handle complex therapeutic work.
Broader questions about efficacy and alternatives
- Commenters dispute how effective therapy itself is versus talking to friends or addressing social causes (isolation, social media, economic precarity).
- Several propose realistic near‑term roles: LLMs as adjuncts or “autopilots” supporting human therapists, or as low‑stakes, self‑help tools rather than full replacements.