2024-06-06

Qwen2 LLM Released

Tiny and Small Models (0.5B–3.8B)

0.5B Qwen2 model with 32k context is seen as interesting mainly as a finetuning / embedding base, not as a strong out-of-the-box chat model.
Opinions diverge: some call sub-500M models “pretty much useless” for summarization; others report they work well when fine-tuned on classic NLP tasks (classification, labeling), potentially replacing BERT/RoBERTa/BART-style models.
Suggested uses: speculative decoding to speed larger models; predictive keyboards; text completion; compression; OCR/speech disambiguation, where imperfect “hinting” is acceptable.
Several note that summarization, especially over long context, is hard even for larger models.

Practical Use Cases for Small LLMs

Emphasis on on-device, background automation rather than chat:
- Meeting transcription → summaries, key topics, action items, speaker attribution.
- Notification and note summarization, auto-titles, tag suggestions, context-aware quick replies.
- In-browser data extraction (e.g., job postings into structured fields) with larger models orchestrating smaller ones.

Performance, Benchmarks, and Comparisons

Qwen2-72B is reported (by its authors) to outperform Llama 3 70B on many benchmarks; some call this plausible, others distrust self-reported numbers and prefer community leaderboards (e.g., LMsys Arena).
Thread references newer benchmarks (MMLU-Pro, MixEval, Arena Hard, LiveCodeBench) to address saturation/overfitting in older tests.
Debate over whether progress is plateauing: some say compute is the limiting factor; others point to unreleased larger models and continuing gains.
Qwen2 MoE (57B weights, ~14B active) is seen as a strong “middle-size” option; comparisons drawn to Mixtral and Yi.

Licensing and “Open Source” Debate

Praise for Apache 2.0 licensing on most Qwen2 models; 72B uses an older, more restrictive license but is still considered relatively permissive.
Heated debate over calling such models “open source”:
- One side: models with Apache 2.0 weights are “open source” even if training data is closed.
- Other side: without open training data/recipe, these are “open weights” or “freeware,” not true open source.
Some argue that open weights are still highly valuable for fine-tuning, interpretability, and model merging, even without full data transparency.

Censorship, Alignment, and Safety

Users report errors or dropped responses when asking about Tiananmen Square and Chinese politics in hosted demos.
Others note that local runs of the 7B model can answer these topics, suggesting censorship or instability in the online service rather than in the raw weights.
Alignment around political topics appears inconsistent: sometimes refusals, sometimes partial or contradictory answers.

Training Infrastructure and Data Practices

Curiosity about how Chinese companies train large models under GPU export restrictions; speculation includes legacy Nvidia GPUs, domestic accelerators (e.g., Huawei Ascend), and foreign data centers.
It is noted that training pipelines often upweight certain data sources (e.g., internal emails, Wikipedia) via sampling frequency rather than “priority” at inference.

Model Proliferation and Architecture

Some complain that many new LLMs are “the same thing” without architectural novelty, likening the situation to Linux distro fragmentation.
Others counter that differences in architecture (e.g., GQA, MoE, context length) and licensing meaningfully expand options and are part of normal scientific/engineering iteration.