2024-11-18

Extending the context length to 1M tokens

Three-Body Problem demo & long-context claims

Several commenters note that the page’s Three-Body Problem summaries contain factual errors and hallucinations, undercutting the long-context “read three books” demo.
Some argue this doesn’t really test 1M-token context, since smaller models already know the books from training data.
Others highlight that disentangling “using only the provided context” from prior training is ill-defined, since both humans and models inevitably mix prior knowledge with new input.

Do LLMs surpass human intelligence?

One side claims modern LLMs already “far surpass” average humans on many cognitive tasks, especially fast reading, summarization, retrieval, cross-domain synthesis, and “napkin math.”
Others strongly disagree: speed and scale are not intelligence; models lack robust reasoning, long-term learning, planning, and grounding in the physical world.
Comparisons are made to calculators and chainsaws: extremely capable tools, but not “intelligent” in a human sense.

Creativity, novelty, and failure modes

Supporters cite personal use cases (trip planning, recipes, branding, lyrics, code sketches, theoretical explorations) as evidence of everyday creativity and cross-domain competence beyond most people.
Skeptics respond that:
- These outputs are recombinations of training data, not evidence of deep understanding.
- Quality often collapses on non-routine tasks (e.g., implementing a real S3 backend, precise drawing constraints, complex reasoning).
- Human and LLM failure modes differ: humans may be slow or unskilled, but models confidently hallucinate and struggle to follow exact specifications.
There’s recurring debate over whether “hallucination” in LLMs is analogous to human rationalization and memory errors.

Singularity, self-improvement, and sentience

Some argue we may already have passed a “technological singularity” in practical capability; others insist the true singularity requires autonomous systems that can design and train significantly better successors.
On sentience, commenters note that:
- We lack clear criteria even for humans.
- A genuinely sentient AI might be dismissed as “just mimicking patterns,” especially given current safety fine-tuning that suppresses such claims.

Qwen 2.5 long-context practicality & availability

Local users praise Qwen2.5 Coder but struggle with hardware limits and long-context use in GGUF-based setups.
There’s confusion around enabling 128k+ context via configuration flags and YaRN scaling; some report degradation and hallucinations on very long inputs.
Needle-in-a-haystack benchmarks showing 100% retrieval at 1M tokens are met with skepticism; individual tests at 32k context already show misses.
Commenters note that the 1M-token long-context model does not appear to be downloadable; weights are not clearly released.
One nitpick corrects the blog’s “1M tokens ≈ 1M words,” arguing typical English tokenization yields closer to ~750k words.

Related topics