I let ChatGPT analyze a decade of my Apple Watch data, then I called my doctor

Apple Watch & VO2 Max Accuracy

  • Debate over blame: some argue Apple misrepresents Apple Watch VO2 max as “validated,” others note Apple’s own studies show systematic underestimation and wide individual error, so it’s not clinical grade.
  • Several commenters report Apple Watch (and similar devices) giving implausibly low VO2 max or alarming fitness warnings that doctors later dismissed.
  • Others say wearables (especially Garmin / Oura) can be quite accurate for trends and useful when used correctly, but require controlled conditions and are sensitive to confounders like pace, altitude, and whether workouts are recorded.

What an LLM Can (and Can’t) Do With Health Data

  • Strong view that LLMs are the wrong tool for raw multi‑year time series: they produce plausible text, not validated numerical analysis, and will “simulate” analysis rather than perform it.
  • Some suggest the right pattern is: have the LLM generate code/notebooks to analyze data, then review results with a doctor.
  • Others counter that specialized models for wearable data exist and could, in theory, be aligned with LLMs, but this isn’t what generic chatbots are doing now.

Responsibility, Risk, and Regulation

  • Split between “users should know it can be wrong; warnings exist” and “marketing and product design explicitly portray ChatGPT as trustworthy for health, so the burden is on the company.”
  • Some want stricter guardrails: health Q&A only at a general level, explicit refusal to interpret personal data, stronger disclaimers or gating.
  • Others argue society routinely uses imperfect tools; banning access until models are “perfect” is unrealistic.

False Positives, Anxiety, and Healthcare Costs

  • Multiple anecdotes of frightening but wrong AI “diagnoses” leading to traumatic worry and unnecessary medical workups.
  • Others share cases where ChatGPT suggested overlooked possibilities (e.g., gallbladder issue) that ultimately proved correct after specialist testing.
  • Several note that in medicine, false positives are costly (money, time, radiation, procedures, anxiety), so a model that “sees red flags everywhere” is harmful.

Doctors vs. AI, and How to Use These Tools

  • Many emphasize doctors view metrics in context (symptoms, risk factors, exam), whereas the article asked an LLM to compress heterogeneous metrics into a single “grade,” which doctors don’t do.
  • Some feel doctors under-address “small problems” and subtle fitness issues, leaving a vacuum filled by wearables, forums, and now AI.
  • Others stress doctors vary widely in quality and staying current; in the best case, AI can help patients ask better questions and surface research, not replace clinical judgment.

Health Metrics Need Context

  • Commenters highlight that VO2 max, BMI, HRV, resting heart rate, etc. are population tools, not absolute individual health scores.
  • Fitness vs. health distinction: someone can be “healthy enough” by medical standards yet unfit by athletic standards; an internet‑trained model may adopt the fitness‑culture framing and grade harshly.
  • Overemphasis on a single metric (VO2 max, BMI) without clinical context is seen as a core flaw in both the article’s setup and AI‑driven “health grades.”

Privacy and Data Use

  • Some find the very act of uploading detailed health data to a commercial AI service “alarming,” given data‑sale incentives and unclear secondary uses.
  • Others are more focused on potential future benefits (long‑term baseline data for better models) and ask for minimally obtrusive trackers with local, exportable data.

Overall Sentiment

  • Broad consensus: current general‑purpose LLMs are not ready to interpret personal medical data or issue health grades.
  • Many see potential value in specialized, clinically validated models paired with human clinicians, and in using AI as a pattern‑spotter and explainer—not as an oracle.