2024-05-16

ChatGPT-4o vs. Math

Math performance on the tape problem

GPT‑4o often mis-solves the tape-roll thickness problem, especially when using the image; main recurring error: treating labeled diameters as radii.
With text-only plus an explicit chain-of-thought prompt, it solves the problem more reliably, but still not perfectly.
Some viewers think the model’s first attempt is “impressive but wrong”; others argue partial credit is irrelevant because it’s a product, not a student.
People compare its performance to average humans; opinion is split on whether “better than a random person on the street” is a meaningful bar.

Images, multimodality, and prompt engineering

Multimodal input often introduces extra failure modes: misreading labels, overfitting to visual biases (e.g., “after” images assumed better in UI comparisons).
Several commenters find text-only + structured equations (LaTeX, symbolic form) more reliable than mixing in images.
Image-specific chain-of-thought (“extract all measurements first, make no assumptions”) improves accuracy somewhat but remains inconsistent.

Statistical vs logical reasoning

Many emphasize that LLM reasoning is fundamentally statistical, not logical; it tends to choose probable continuations rather than enforce correctness.
This leads to confident but wrong math, and to changing answers when challenged rather than defending a correct result.
Some suggest bolting on formal tools (SAT solvers, theorem provers, Python, Wolfram) and using LLMs mainly to translate natural language into formal specs.

Reliability, determinism, and prompting tricks

Non-determinism is a concern: same question, different runs, different answers; even when right once, it may later be wrong.
With temperature 0 and fixed seeds, API calls can be deterministic—meaning the wrong answer would also be repeatable.
Users report success with “double check” / multi-pass prompts and re-asking the same query to reduce errors, but this increases cost and remains heuristic.

Usefulness vs limitations for math and code

Several commenters distrust LLMs for precise math, advanced topics, or production-grade code; verification effort often matches doing the work yourself.
Others find them genuinely helpful for:
- High-level overviews, prerequisites, and orientation in unfamiliar fields.
- Drafting code / Wolfram Language snippets that are then verified and run by the user.
- Inspiration and informal “conversation” about mathematical ideas.

Broader reflections on LLMs and AGI

Debate over whether weak math skills disqualify LLMs from being “intelligent” vs. whether their general language ability is already transformative.
Some argue math is “low-hanging fruit” for logical AI; others point to complexity (NP-complete reasoning, undecidability) and limited training data for step-by-step math.
There is skepticism about calling current systems “AGI,” especially given their lack of stable memory and robust logical reasoning, though a few see them as already “general” in a practical sense.