Extending the context length to 1M tokens
Three-Body Problem demo & long-context claims
- Several commenters note that the page’s Three-Body Problem summaries contain factual errors and hallucinations, undercutting the long-context “read three books” demo.
- Some argue this doesn’t really test 1M-token context, since smaller models already know the books from training data.
- Others highlight that disentangling “using only the provided context” from prior training is ill-defined, since both humans and models inevitably mix prior knowledge with new input.
Do LLMs surpass human intelligence?
- One side claims modern LLMs already “far surpass” average humans on many cognitive tasks, especially fast reading, summarization, retrieval, cross-domain synthesis, and “napkin math.”
- Others strongly disagree: speed and scale are not intelligence; models lack robust reasoning, long-term learning, planning, and grounding in the physical world.
- Comparisons are made to calculators and chainsaws: extremely capable tools, but not “intelligent” in a human sense.
Creativity, novelty, and failure modes
- Supporters cite personal use cases (trip planning, recipes, branding, lyrics, code sketches, theoretical explorations) as evidence of everyday creativity and cross-domain competence beyond most people.
- Skeptics respond that:
- These outputs are recombinations of training data, not evidence of deep understanding.
- Quality often collapses on non-routine tasks (e.g., implementing a real S3 backend, precise drawing constraints, complex reasoning).
- Human and LLM failure modes differ: humans may be slow or unskilled, but models confidently hallucinate and struggle to follow exact specifications.
- There’s recurring debate over whether “hallucination” in LLMs is analogous to human rationalization and memory errors.
Singularity, self-improvement, and sentience
- Some argue we may already have passed a “technological singularity” in practical capability; others insist the true singularity requires autonomous systems that can design and train significantly better successors.
- On sentience, commenters note that:
- We lack clear criteria even for humans.
- A genuinely sentient AI might be dismissed as “just mimicking patterns,” especially given current safety fine-tuning that suppresses such claims.
Qwen 2.5 long-context practicality & availability
- Local users praise Qwen2.5 Coder but struggle with hardware limits and long-context use in GGUF-based setups.
- There’s confusion around enabling 128k+ context via configuration flags and YaRN scaling; some report degradation and hallucinations on very long inputs.
- Needle-in-a-haystack benchmarks showing 100% retrieval at 1M tokens are met with skepticism; individual tests at 32k context already show misses.
- Commenters note that the 1M-token long-context model does not appear to be downloadable; weights are not clearly released.
- One nitpick corrects the blog’s “1M tokens ≈ 1M words,” arguing typical English tokenization yields closer to ~750k words.