Qwen2 LLM Released

Tiny and Small Models (0.5B–3.8B)

  • 0.5B Qwen2 model with 32k context is seen as interesting mainly as a finetuning / embedding base, not as a strong out-of-the-box chat model.
  • Opinions diverge: some call sub-500M models “pretty much useless” for summarization; others report they work well when fine-tuned on classic NLP tasks (classification, labeling), potentially replacing BERT/RoBERTa/BART-style models.
  • Suggested uses: speculative decoding to speed larger models; predictive keyboards; text completion; compression; OCR/speech disambiguation, where imperfect “hinting” is acceptable.
  • Several note that summarization, especially over long context, is hard even for larger models.

Practical Use Cases for Small LLMs

  • Emphasis on on-device, background automation rather than chat:
    • Meeting transcription → summaries, key topics, action items, speaker attribution.
    • Notification and note summarization, auto-titles, tag suggestions, context-aware quick replies.
    • In-browser data extraction (e.g., job postings into structured fields) with larger models orchestrating smaller ones.

Performance, Benchmarks, and Comparisons

  • Qwen2-72B is reported (by its authors) to outperform Llama 3 70B on many benchmarks; some call this plausible, others distrust self-reported numbers and prefer community leaderboards (e.g., LMsys Arena).
  • Thread references newer benchmarks (MMLU-Pro, MixEval, Arena Hard, LiveCodeBench) to address saturation/overfitting in older tests.
  • Debate over whether progress is plateauing: some say compute is the limiting factor; others point to unreleased larger models and continuing gains.
  • Qwen2 MoE (57B weights, ~14B active) is seen as a strong “middle-size” option; comparisons drawn to Mixtral and Yi.

Licensing and “Open Source” Debate

  • Praise for Apache 2.0 licensing on most Qwen2 models; 72B uses an older, more restrictive license but is still considered relatively permissive.
  • Heated debate over calling such models “open source”:
    • One side: models with Apache 2.0 weights are “open source” even if training data is closed.
    • Other side: without open training data/recipe, these are “open weights” or “freeware,” not true open source.
  • Some argue that open weights are still highly valuable for fine-tuning, interpretability, and model merging, even without full data transparency.

Censorship, Alignment, and Safety

  • Users report errors or dropped responses when asking about Tiananmen Square and Chinese politics in hosted demos.
  • Others note that local runs of the 7B model can answer these topics, suggesting censorship or instability in the online service rather than in the raw weights.
  • Alignment around political topics appears inconsistent: sometimes refusals, sometimes partial or contradictory answers.

Training Infrastructure and Data Practices

  • Curiosity about how Chinese companies train large models under GPU export restrictions; speculation includes legacy Nvidia GPUs, domestic accelerators (e.g., Huawei Ascend), and foreign data centers.
  • It is noted that training pipelines often upweight certain data sources (e.g., internal emails, Wikipedia) via sampling frequency rather than “priority” at inference.

Model Proliferation and Architecture

  • Some complain that many new LLMs are “the same thing” without architectural novelty, likening the situation to Linux distro fragmentation.
  • Others counter that differences in architecture (e.g., GQA, MoE, context length) and licensing meaningfully expand options and are part of normal scientific/engineering iteration.