Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens

Running Qwen2.5-1M Locally (Mac, MLX, GGUF, CPU)

  • People are experimenting with running very long prompts (hundreds of thousands of tokens) on Macs, especially M3/M4 Max with 64–128GB unified memory.
  • One report: ~446k-token Rust/TypeScript codebase query on an M4 Max ran ~4 hours and returned a seemingly reasonable answer.
  • MLX 4-bit variants exist for macOS, but current MLX doesn’t yet support the dual-chunk attention mechanism used for full 1M-token context.
  • Some consider trying 1M-token prompts on large-RAM CPU servers, but expect it to be extremely slow.

Memory, Context Length, and KV Cache

  • Long context is dominated by KV cache memory, which scales with sequence length; 1M tokens requires “obscene” amounts of RAM/VRAM.
  • Official guidance:
    • Qwen2.5-7B-Instruct-1M: ~120GB VRAM for full 1M context.
    • Qwen2.5-14B-Instruct-1M: ~320GB VRAM.
  • KV cache quantization (e.g., 4-bit) can cut cache memory to ~¼, at the cost of quality.
  • Comparison table for another model shows memory rising steeply as context grows (e.g., ~27.5GB→109.8GB going from 200k @4-bit→16-bit).

Ollama Defaults and Context Configuration

  • Ollama’s num_ctx defaults to 2k and is widely seen as a “foot-gun”: it silently discards leading tokens when exceeded.
  • Users must explicitly set num_ctx higher or save a model variant with increased context. Documentation and behavior are criticized as confusing.
  • Plugins and tools (e.g., files-to-prompt integrations) support passing num_ctx, but users often misinterpret it as output length.

Hardware Choices and Cost/Accessibility

  • Macs with large unified memory are attractive for big-context local inference, but RAM configurations are expensive and tightly coupled to CPU tiers.
  • Some argue multi-GPU x86 builds (e.g., multiple 3090s) offer better raw compute per dollar, but they lack the large unified memory of Apple Silicon.
  • Debate over class/access: high-RAM Macs and large GPU rigs are seen as increasingly out of reach for many, shifting experimentation back toward the well-funded.

Actual Usefulness of Huge Context Windows

  • Multiple reports say models often degrade beyond ~25–32k tokens for coding and other precise tasks: loss of instruction-following, missed files already in context, poor recall.
  • Others counter that 1M–2M context in some services works well for high-level overviews or summarization of large codebases or corpora.
  • Overall sentiment: large context is promising but unreliable for complex, fine-grained tasks; retrieval quality and “lost in the middle” remain major issues.

Benchmarks, Long-Context Hype, and Limits

  • Skepticism about “nearly perfect” long-context claims: detailed tables show significantly less than 100% on complex tasks and often only up to 128k, not full 1M.
  • Long context is distinguished from generation length; several commenters note that output length across turns is still a hard, unresolved problem.