Kids who use ChatGPT as a study assistant do worse on tests

Study design & core findings

  • Study had three groups during math practice:
    • Control: no GPT.
    • “GPT Base”: standard GPT‑4, gives full answers.
    • “GPT Tutor”: GPT‑4 with prompts to give hints, not answers, and tuned on the problem set.
  • With GPT access during practice:
    • GPT Base students solved ~48% more practice problems correctly.
    • GPT Tutor students solved ~127% more practice problems correctly.
  • On a later closed-book exam with no GPT:
    • GPT Base group scored ~17% worse than control.
    • GPT Tutor group was statistically indistinguishable from control (slightly lower but not significant).
  • GPT had a high error rate overall, especially in multi-step reasoning; the tutor version was fed correct answers.

How students used GPT & overconfidence

  • Many commenters infer students often offloaded thinking to GPT, especially in the Base condition.
  • Both GPT groups were more confident they’d done well, despite equal or worse exam scores, suggesting inflated self-assessment.

Struggle, learning, and “crutch” concerns

  • Repeated theme: productive struggle (trying, failing, correcting) is where learning happens; instant answers short-circuit this.
  • Several developers report similar effects with Copilot/LLMs: their “thinking turns off” when an easy button is available.
  • Others say they’ve learned more (e.g., bash, AI, deep learning) via ChatGPT than via traditional resources, when used as an explainer/tutor after personal effort.

Comparisons to other tools

  • Analogies to:
    • Parents doing homework.
    • Calculators in early math.
    • Stack Overflow / Google search.
  • Consensus: tools can either be force multipliers or crutches; experts benefit more because they can verify and direct the tool.

Assessment relevance & future skills

  • Some argue the exam is like banning cars and then testing on horse racing; “real world” performance with AI may matter more.
  • Others counter that:
    • Many tasks still can’t safely rely on LLMs.
    • You must already understand the domain to catch hallucinations.
    • Foundational reasoning skills are still essential.

Study limitations & policy implications

  • Conducted in one Turkish high school; generalizability is questioned but many doubt results would radically differ elsewhere.
  • It’s a preprint, not yet peer-reviewed; some see the media title as misleading or overly strong.
  • Broad takeaway in the thread: unmanaged “answer-giving” GPT harms learning; carefully constrained “tutor mode” at least avoids harm but, in this study, didn’t clearly improve it.