2024-09-05

Kids who use ChatGPT as a study assistant do worse on tests

Study design & core findings

Study had three groups during math practice:
- Control: no GPT.
- “GPT Base”: standard GPT‑4, gives full answers.
- “GPT Tutor”: GPT‑4 with prompts to give hints, not answers, and tuned on the problem set.
With GPT access during practice:
- GPT Base students solved ~48% more practice problems correctly.
- GPT Tutor students solved ~127% more practice problems correctly.
On a later closed-book exam with no GPT:
- GPT Base group scored ~17% worse than control.
- GPT Tutor group was statistically indistinguishable from control (slightly lower but not significant).
GPT had a high error rate overall, especially in multi-step reasoning; the tutor version was fed correct answers.

How students used GPT & overconfidence

Many commenters infer students often offloaded thinking to GPT, especially in the Base condition.
Both GPT groups were more confident they’d done well, despite equal or worse exam scores, suggesting inflated self-assessment.

Struggle, learning, and “crutch” concerns

Repeated theme: productive struggle (trying, failing, correcting) is where learning happens; instant answers short-circuit this.
Several developers report similar effects with Copilot/LLMs: their “thinking turns off” when an easy button is available.
Others say they’ve learned more (e.g., bash, AI, deep learning) via ChatGPT than via traditional resources, when used as an explainer/tutor after personal effort.

Comparisons to other tools

Analogies to:
- Parents doing homework.
- Calculators in early math.
- Stack Overflow / Google search.
Consensus: tools can either be force multipliers or crutches; experts benefit more because they can verify and direct the tool.

Assessment relevance & future skills

Some argue the exam is like banning cars and then testing on horse racing; “real world” performance with AI may matter more.
Others counter that:
- Many tasks still can’t safely rely on LLMs.
- You must already understand the domain to catch hallucinations.
- Foundational reasoning skills are still essential.

Study limitations & policy implications

Conducted in one Turkish high school; generalizability is questioned but many doubt results would radically differ elsewhere.
It’s a preprint, not yet peer-reviewed; some see the media title as misleading or overly strong.
Broad takeaway in the thread: unmanaged “answer-giving” GPT harms learning; carefully constrained “tutor mode” at least avoids harm but, in this study, didn’t clearly improve it.

Related topics