Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens
Running Qwen2.5-1M Locally (Mac, MLX, GGUF, CPU)
- People are experimenting with running very long prompts (hundreds of thousands of tokens) on Macs, especially M3/M4 Max with 64–128GB unified memory.
- One report: ~446k-token Rust/TypeScript codebase query on an M4 Max ran ~4 hours and returned a seemingly reasonable answer.
- MLX 4-bit variants exist for macOS, but current MLX doesn’t yet support the dual-chunk attention mechanism used for full 1M-token context.
- Some consider trying 1M-token prompts on large-RAM CPU servers, but expect it to be extremely slow.
Memory, Context Length, and KV Cache
- Long context is dominated by KV cache memory, which scales with sequence length; 1M tokens requires “obscene” amounts of RAM/VRAM.
- Official guidance:
- Qwen2.5-7B-Instruct-1M: ~120GB VRAM for full 1M context.
- Qwen2.5-14B-Instruct-1M: ~320GB VRAM.
- KV cache quantization (e.g., 4-bit) can cut cache memory to ~¼, at the cost of quality.
- Comparison table for another model shows memory rising steeply as context grows (e.g., ~27.5GB→109.8GB going from 200k @4-bit→16-bit).
Ollama Defaults and Context Configuration
- Ollama’s
num_ctxdefaults to 2k and is widely seen as a “foot-gun”: it silently discards leading tokens when exceeded. - Users must explicitly set
num_ctxhigher or save a model variant with increased context. Documentation and behavior are criticized as confusing. - Plugins and tools (e.g., files-to-prompt integrations) support passing
num_ctx, but users often misinterpret it as output length.
Hardware Choices and Cost/Accessibility
- Macs with large unified memory are attractive for big-context local inference, but RAM configurations are expensive and tightly coupled to CPU tiers.
- Some argue multi-GPU x86 builds (e.g., multiple 3090s) offer better raw compute per dollar, but they lack the large unified memory of Apple Silicon.
- Debate over class/access: high-RAM Macs and large GPU rigs are seen as increasingly out of reach for many, shifting experimentation back toward the well-funded.
Actual Usefulness of Huge Context Windows
- Multiple reports say models often degrade beyond ~25–32k tokens for coding and other precise tasks: loss of instruction-following, missed files already in context, poor recall.
- Others counter that 1M–2M context in some services works well for high-level overviews or summarization of large codebases or corpora.
- Overall sentiment: large context is promising but unreliable for complex, fine-grained tasks; retrieval quality and “lost in the middle” remain major issues.
Benchmarks, Long-Context Hype, and Limits
- Skepticism about “nearly perfect” long-context claims: detailed tables show significantly less than 100% on complex tasks and often only up to 128k, not full 1M.
- Long context is distinguished from generation length; several commenters note that output length across turns is still a hard, unresolved problem.