ChatGPT-4o vs. Math
Math performance on the tape problem
- GPT‑4o often mis-solves the tape-roll thickness problem, especially when using the image; main recurring error: treating labeled diameters as radii.
- With text-only plus an explicit chain-of-thought prompt, it solves the problem more reliably, but still not perfectly.
- Some viewers think the model’s first attempt is “impressive but wrong”; others argue partial credit is irrelevant because it’s a product, not a student.
- People compare its performance to average humans; opinion is split on whether “better than a random person on the street” is a meaningful bar.
Images, multimodality, and prompt engineering
- Multimodal input often introduces extra failure modes: misreading labels, overfitting to visual biases (e.g., “after” images assumed better in UI comparisons).
- Several commenters find text-only + structured equations (LaTeX, symbolic form) more reliable than mixing in images.
- Image-specific chain-of-thought (“extract all measurements first, make no assumptions”) improves accuracy somewhat but remains inconsistent.
Statistical vs logical reasoning
- Many emphasize that LLM reasoning is fundamentally statistical, not logical; it tends to choose probable continuations rather than enforce correctness.
- This leads to confident but wrong math, and to changing answers when challenged rather than defending a correct result.
- Some suggest bolting on formal tools (SAT solvers, theorem provers, Python, Wolfram) and using LLMs mainly to translate natural language into formal specs.
Reliability, determinism, and prompting tricks
- Non-determinism is a concern: same question, different runs, different answers; even when right once, it may later be wrong.
- With temperature 0 and fixed seeds, API calls can be deterministic—meaning the wrong answer would also be repeatable.
- Users report success with “double check” / multi-pass prompts and re-asking the same query to reduce errors, but this increases cost and remains heuristic.
Usefulness vs limitations for math and code
- Several commenters distrust LLMs for precise math, advanced topics, or production-grade code; verification effort often matches doing the work yourself.
- Others find them genuinely helpful for:
- High-level overviews, prerequisites, and orientation in unfamiliar fields.
- Drafting code / Wolfram Language snippets that are then verified and run by the user.
- Inspiration and informal “conversation” about mathematical ideas.
Broader reflections on LLMs and AGI
- Debate over whether weak math skills disqualify LLMs from being “intelligent” vs. whether their general language ability is already transformative.
- Some argue math is “low-hanging fruit” for logical AI; others point to complexity (NP-complete reasoning, undecidability) and limited training data for step-by-step math.
- There is skepticism about calling current systems “AGI,” especially given their lack of stable memory and robust logical reasoning, though a few see them as already “general” in a practical sense.