Extending the context length to 1M tokens

Three-Body Problem demo & long-context claims

  • Several commenters note that the page’s Three-Body Problem summaries contain factual errors and hallucinations, undercutting the long-context “read three books” demo.
  • Some argue this doesn’t really test 1M-token context, since smaller models already know the books from training data.
  • Others highlight that disentangling “using only the provided context” from prior training is ill-defined, since both humans and models inevitably mix prior knowledge with new input.

Do LLMs surpass human intelligence?

  • One side claims modern LLMs already “far surpass” average humans on many cognitive tasks, especially fast reading, summarization, retrieval, cross-domain synthesis, and “napkin math.”
  • Others strongly disagree: speed and scale are not intelligence; models lack robust reasoning, long-term learning, planning, and grounding in the physical world.
  • Comparisons are made to calculators and chainsaws: extremely capable tools, but not “intelligent” in a human sense.

Creativity, novelty, and failure modes

  • Supporters cite personal use cases (trip planning, recipes, branding, lyrics, code sketches, theoretical explorations) as evidence of everyday creativity and cross-domain competence beyond most people.
  • Skeptics respond that:
    • These outputs are recombinations of training data, not evidence of deep understanding.
    • Quality often collapses on non-routine tasks (e.g., implementing a real S3 backend, precise drawing constraints, complex reasoning).
    • Human and LLM failure modes differ: humans may be slow or unskilled, but models confidently hallucinate and struggle to follow exact specifications.
  • There’s recurring debate over whether “hallucination” in LLMs is analogous to human rationalization and memory errors.

Singularity, self-improvement, and sentience

  • Some argue we may already have passed a “technological singularity” in practical capability; others insist the true singularity requires autonomous systems that can design and train significantly better successors.
  • On sentience, commenters note that:
    • We lack clear criteria even for humans.
    • A genuinely sentient AI might be dismissed as “just mimicking patterns,” especially given current safety fine-tuning that suppresses such claims.

Qwen 2.5 long-context practicality & availability

  • Local users praise Qwen2.5 Coder but struggle with hardware limits and long-context use in GGUF-based setups.
  • There’s confusion around enabling 128k+ context via configuration flags and YaRN scaling; some report degradation and hallucinations on very long inputs.
  • Needle-in-a-haystack benchmarks showing 100% retrieval at 1M tokens are met with skepticism; individual tests at 32k context already show misses.
  • Commenters note that the 1M-token long-context model does not appear to be downloadable; weights are not clearly released.
  • One nitpick corrects the blog’s “1M tokens ≈ 1M words,” arguing typical English tokenization yields closer to ~750k words.