Ontario auditors find doctors' AI note takers routinely blow basic facts

Scope of AI Note-Taker Problems

  • Multiple anecdotes of LLM note-takers fabricating or distorting key details in meetings and medical visits.
  • Examples include: a vendor “promising” something they did not; Zoom summaries misattributing statements; a runner’s knee visit turned into an osteoporosis diagnosis with invented symptoms.
  • Users report that for simple, linear interactions they can “get the gist,” but fail badly on nuanced, technical, or emotionally charged conversations.

Transcripts vs Summaries & Provenance

  • Several commenters argue transcripts should be the legal/clinical ground truth, with optional human-written summaries.
  • Others note speech-to-text itself is probabilistic and can also mislead if treated as authoritative.
  • Strong support for timestamped links from summaries back to recordings (“provenance”) in non-medical settings, but concern this is harder in HIPAA-like environments.

Medical Context, Risk, and Responsibility

  • Many see AI scribes in healthcare as especially dangerous: mixing up drugs or diagnoses is unacceptable.
  • Some argue human documentation is already error-prone, but others stress:
    • Machines must be better than humans to be worth using.
    • AI errors are qualitatively different (confident hallucinations of things never said).
  • Patients are urged to check visit summaries and request corrections; some already do this routinely.
  • Doctors report having to spend extra time correcting AI notes, sometimes feeling the tech is being forced on them.

Procurement, Incentives, and Data Exploitation

  • Ontario’s vendor scoring is criticized: domestic presence heavily weighted, note accuracy only a small part of the score.
  • Commenters worry less about short-term accuracy than about long-term incentives: real-time data feeds into insurers, pharma, and hospital billing, with little alignment to patient interests.

Capabilities vs Reliability and “Knowing What It Doesn’t Know”

  • Ongoing debate over whether model accuracy will naturally improve enough for critical use.
  • Distinction raised between capability (benchmarks, impressive demos) and reliability (consistent, low-risk behavior in production).
  • Extended side-discussion on confidence estimation and calibration:
    • Models can output probability distributions over tokens, but turning this into trustworthy “I don’t know” behavior remains unsolved in practice.
    • Some believe models could be trained to refuse answers more often; others think business incentives discourage visible uncertainty.

Privacy, Recording, and Appropriate Use

  • Split views on recording whole doctor–patient conversations:
    • One side sees comprehensive recording as aligned with the idea of medical records.
    • The other stresses the historic role of physicians as filters, privacy risks over a lifetime, and chilling effects on honest disclosure.
  • Several argue that AI is being misapplied to high-accuracy back-end tasks like clinical documentation, and might be more appropriate for front-end intake, triage help, or form-filling—always with human verification.