GPT-5o-mini hallucinates medical residency applicant grades

Context and real‑world impact

  • A residency management vendor used an LLM-based system (“GPT‑5o‑mini” in their docs) to auto-extract clerkship grades from unstandardized PDF transcripts.
  • Programs detected discrepancies, including fabricated “fail” grades, directly affecting applicants’ perceived competitiveness in a very high‑stakes process.
  • The company corrected specific reported errors but appears to be continuing the tool, positioning it as “for reference” with manual verification recommended.

Why they used LLMs instead of structured data

  • Many argue this exists only because schools send heterogeneous PDFs instead of structured data or using a shared API.
  • Others counter that getting thousands of institutions to adopt a standard or API is extremely hard; PDFs and even fax/FTP‑style flows remain the de facto inter‑org medium.
  • Suggestions like having applicants self‑enter grades run into complexity: nonstandard grading schemes, distributions, narrative rankings, and students not always seeing full official letters.

Technical debate: PDFs, OCR, and LLM suitability

  • Some say this should have been solved with traditional OCR + parsing and that LLMs are an overkill, marketing‑driven choice.
  • Others, with experience in insurance/finance/document processing, say arbitrary PDFs (especially tables, multi‑column layouts, scans) are not a solved problem, and vision‑LLMs actually are state of the art.
  • There is disagreement over whether they used classic OCR then LLM, or a vision‑LLM for OCR-like extraction. In any case, critics stress that trusting a single LLM pass as ground truth is irresponsible.
  • Using a small/“mini” model for such a critical task is widely seen as especially reckless.

Hallucinations, terminology, and model limits

  • Multiple comments debate the word “hallucination”:
    • Some dislike it as anthropomorphic; the model is just generating plausible text by design, not “seeing things.”
    • Others defend it as an effective shorthand for “confidently wrong outputs” for nontechnical users.
  • Several note that adding RAG/search does not eliminate errors; models can still confidently invent content “in the language of” the retrieved documents.

Responsibility, validation, and product design

  • Many see “verify manually” disclaimers as unrealistic: in practice, busy reviewers will treat AI output as authoritative, especially when sold as efficiency‑boosting.
  • Commenters call this a software/quality problem more than an AI problem: no evident benchmarks, error‑rate measurement, multi‑model cross‑checks, or automated validation against the original text.
  • Broader concern: strong business pressure to deploy LLMs in consequential decision flows despite well‑known, persistent failure modes.