OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

Study design & limitations

  • Many argue the trial is closer to a “paper quiz” than real ER work: AI and doctors saw text from electronic records and nurse notes, not real patients.
  • Doctors were forced to diagnose from notes alone, which they rarely do in practice; physical exam, conversation, and observation were excluded.
  • When both AI and humans had fuller case details, the performance gap shrank and became statistically insignificant, weakening “AI beats doctors” claims.
  • Some note the study uses older models and vignette-style cases, which are useful early steps but far from real-world validation.

Comparisons with other AI-medical studies

  • A recent chest x‑ray benchmark is cited where an AI outperformed radiologists even without seeing images, highlighting how flawed benchmarks can be.
  • Another study with “ChatGPT Health” reportedly mis-triaged about half of emergency cases, showing inconsistency across setups and models.

Human vs AI capabilities

  • Supporters think diagnosis is largely pattern recognition over vast knowledge; specialized medical models will likely surpass most doctors over time.
  • Skeptics emphasize:
    • Physical exam, nuanced history-taking, and detecting deceit or missing info.
    • Judgment under uncertainty, and knowing when to say “I don’t know, we need more tests.”
    • Emotional presence during crises (e.g., cancer diagnoses) as fundamentally human.

Bias, trust, and patient experiences

  • Multiple anecdotes:
    • Missed or delayed diagnoses by human doctors, especially for women and complex or rare conditions.
    • Others report LLMs helping identify conditions (e.g., long Covid, MCAS) or interpret labs better than rushed clinicians.
    • Some had AI completely miss serious issues (e.g., hip problems on x‑ray), reinforcing caution.

System-level incentives and risks

  • Concerns that:
    • AI may be optimized for liability or billing, not patient welfare.
    • Metric-driven use (e.g., ranking doctors against AI) could lead to gaming, overreliance, and eventual de-skilling.
    • Insurance and private equity may use AI to cut costs or deny care.

Proposed roles for AI

  • Common middle-ground view: AI as:
    • Triage aid, second opinion, and guideline-following checker.
    • Research assistant and note-taker.
    • Tool that should augment, not replace, accountable human clinicians.