2025-01-26

Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens

Running Qwen2.5-1M Locally (Mac, MLX, GGUF, CPU)

People are experimenting with running very long prompts (hundreds of thousands of tokens) on Macs, especially M3/M4 Max with 64–128GB unified memory.
One report: ~446k-token Rust/TypeScript codebase query on an M4 Max ran ~4 hours and returned a seemingly reasonable answer.
MLX 4-bit variants exist for macOS, but current MLX doesn’t yet support the dual-chunk attention mechanism used for full 1M-token context.
Some consider trying 1M-token prompts on large-RAM CPU servers, but expect it to be extremely slow.

Memory, Context Length, and KV Cache

Long context is dominated by KV cache memory, which scales with sequence length; 1M tokens requires “obscene” amounts of RAM/VRAM.
Official guidance:
- Qwen2.5-7B-Instruct-1M: ~120GB VRAM for full 1M context.
- Qwen2.5-14B-Instruct-1M: ~320GB VRAM.
KV cache quantization (e.g., 4-bit) can cut cache memory to ~¼, at the cost of quality.
Comparison table for another model shows memory rising steeply as context grows (e.g., ~27.5GB→109.8GB going from 200k @4-bit→16-bit).

Ollama Defaults and Context Configuration

Ollama’s num_ctx defaults to 2k and is widely seen as a “foot-gun”: it silently discards leading tokens when exceeded.
Users must explicitly set num_ctx higher or save a model variant with increased context. Documentation and behavior are criticized as confusing.
Plugins and tools (e.g., files-to-prompt integrations) support passing num_ctx, but users often misinterpret it as output length.

Hardware Choices and Cost/Accessibility

Macs with large unified memory are attractive for big-context local inference, but RAM configurations are expensive and tightly coupled to CPU tiers.
Some argue multi-GPU x86 builds (e.g., multiple 3090s) offer better raw compute per dollar, but they lack the large unified memory of Apple Silicon.
Debate over class/access: high-RAM Macs and large GPU rigs are seen as increasingly out of reach for many, shifting experimentation back toward the well-funded.

Actual Usefulness of Huge Context Windows

Multiple reports say models often degrade beyond ~25–32k tokens for coding and other precise tasks: loss of instruction-following, missed files already in context, poor recall.
Others counter that 1M–2M context in some services works well for high-level overviews or summarization of large codebases or corpora.
Overall sentiment: large context is promising but unreliable for complex, fine-grained tasks; retrieval quality and “lost in the middle” remain major issues.

Benchmarks, Long-Context Hype, and Limits

Skepticism about “nearly perfect” long-context claims: detailed tables show significantly less than 100% on complex tasks and often only up to 128k, not full 1M.
Long context is distinguished from generation length; several commenters note that output length across turns is still a hard, unresolved problem.

Related topics