ChatGPT-4o vs. Math

Math performance on the tape problem

  • GPT‑4o often mis-solves the tape-roll thickness problem, especially when using the image; main recurring error: treating labeled diameters as radii.
  • With text-only plus an explicit chain-of-thought prompt, it solves the problem more reliably, but still not perfectly.
  • Some viewers think the model’s first attempt is “impressive but wrong”; others argue partial credit is irrelevant because it’s a product, not a student.
  • People compare its performance to average humans; opinion is split on whether “better than a random person on the street” is a meaningful bar.

Images, multimodality, and prompt engineering

  • Multimodal input often introduces extra failure modes: misreading labels, overfitting to visual biases (e.g., “after” images assumed better in UI comparisons).
  • Several commenters find text-only + structured equations (LaTeX, symbolic form) more reliable than mixing in images.
  • Image-specific chain-of-thought (“extract all measurements first, make no assumptions”) improves accuracy somewhat but remains inconsistent.

Statistical vs logical reasoning

  • Many emphasize that LLM reasoning is fundamentally statistical, not logical; it tends to choose probable continuations rather than enforce correctness.
  • This leads to confident but wrong math, and to changing answers when challenged rather than defending a correct result.
  • Some suggest bolting on formal tools (SAT solvers, theorem provers, Python, Wolfram) and using LLMs mainly to translate natural language into formal specs.

Reliability, determinism, and prompting tricks

  • Non-determinism is a concern: same question, different runs, different answers; even when right once, it may later be wrong.
  • With temperature 0 and fixed seeds, API calls can be deterministic—meaning the wrong answer would also be repeatable.
  • Users report success with “double check” / multi-pass prompts and re-asking the same query to reduce errors, but this increases cost and remains heuristic.

Usefulness vs limitations for math and code

  • Several commenters distrust LLMs for precise math, advanced topics, or production-grade code; verification effort often matches doing the work yourself.
  • Others find them genuinely helpful for:
    • High-level overviews, prerequisites, and orientation in unfamiliar fields.
    • Drafting code / Wolfram Language snippets that are then verified and run by the user.
    • Inspiration and informal “conversation” about mathematical ideas.

Broader reflections on LLMs and AGI

  • Debate over whether weak math skills disqualify LLMs from being “intelligent” vs. whether their general language ability is already transformative.
  • Some argue math is “low-hanging fruit” for logical AI; others point to complexity (NP-complete reasoning, undecidability) and limited training data for step-by-step math.
  • There is skepticism about calling current systems “AGI,” especially given their lack of stable memory and robust logical reasoning, though a few see them as already “general” in a practical sense.