HealthBench – An evaluation for AI systems and human health
Model performance, visibility, and access
- Commenters note Grok scores surprisingly well, and argue its lower mindshare vs Gemini/Llama is due more to lack of API access until recently than to open‑weights issues.
- Some point out that open weights are mostly irrelevant here since only one of the ten benchmarked models is open anyway.
- Gemini’s performance is seen as better than expected, with speculation that its tendency to refuse health topics (“censorship”) likely hurt scores. Med‑PaLM is mentioned as obsolete, superseded by Gemini.
Trust, bias, and conflict of interest
- Many see an inherent conflict when a model vendor designs its own benchmark, especially one where its model narrowly beats competitors.
- Others argue the benchmark is still useful, but should be read skeptically given no company would publish a study that makes its product look bad.
- Some suggest such benchmarks should come from neutral or nonprofit entities.
Real‑world behavior: successes and failures
- Multiple anecdotes:
- Serious hallucinations (invented cancer on a lab report, misdiagnosed anemia vs thalassemia) and generic, outdated advice (e.g., low‑fat diets).
- Strong positive cases where o3/o3‑deep‑research gave plausible diagnoses, timelines, and rehab plans that matched or surpassed prior human input.
- Users highlight confusion over which model ChatGPT is using (4o vs 4o‑mini), and how “normies” can’t be expected to understand model quality differences.
Use cases, benchmarks, and system design
- Some want a benchmark focused narrowly on diagnosis (symptoms + history → ground‑truth diagnosis).
- Others question benchmark realism since real deployments often wrap base models with RAG, guardrails, and workflows; counterpoint: this setup accurately reflects “people chatting to ChatGPT.”
Healthcare economics, access, and substitution
- Strong sentiment that many simple cases (e.g., cough medicine prescriptions) could be safely handled by AI, reducing unnecessary visits and costs—especially in systems with severe doctor shortages.
- Others respond that expertise matters precisely in non‑obvious cases, and that patients can’t reliably tell “simple” from dangerous.
- There is concern that AI will be used to justify shifting more responsibility to less‑qualified staff while maintaining prices, exacerbating profit extraction rather than lowering costs.
Safety, liability, and regulation
- One side: LLMs are pseudo‑random text machines that hallucinate and should not be trusted for health advice; this “insanity” must be tightly regulated.
- The other side: human clinicians are also biased, overworked, and fallible; a careful human–AI synthesis could outperform either alone if properly regulated and benchmarked.
- Debate centers on acceptable tradeoffs: saving time and money for many vs the risk of missed serious diagnoses, and how to quantify those tradeoffs.
Doctors using AI vs replacement fears
- Some report doctors already using ChatGPT to look up guidelines and organize thinking, seeing it as an extension of their judgment, not a replacement.
- Others worry that institutions will treat AI outputs as authoritative, degrading human judgment and using the tech to justify staff downgrading.
Miscellaneous
- Several nitpick the “worst‑case at k samples” chart as visually confusing due to nearly identical colors.
- One commenter laments lack of Greek language support despite Greek roots of much medical terminology.