2024-12-13

Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning

Local LLMs on Consumer Hardware

Several commenters say we are already at or near GPT‑3.5 / early GPT‑4 capability on consumer machines.
8–14B models (e.g., Llama 3.1 8B, Qwen 2.5 8B, Gemma 2 9B) reportedly run comfortably on Apple Silicon and feel clearly better than old GPT‑3.5.
70B‑class models (Llama 3.3 70B, Qwen 2.5 72B) can run on 64–128GB MacBook Pros but are described as “sluggish.”
Smaller RAM systems (16–24GB) can run 8–14B models, often at the cost of not doing much else simultaneously.
Some prefer building cheap desktops with consumer GPUs (e.g., 3090, 3060) for much better tokens/sec than Macs at similar or lower price.
Debate over what counts as “consumer hardware”: price vs availability vs typical specs (e.g., Steam survey).

Phi‑4 Capabilities, Benchmarks, and Real‑World Use

Phi‑4 (14B) is praised for unusually strong benchmarks and “punching above its weight,” with some early tests showing surprisingly good answers and reasoning.
Others are skeptical, citing prior Phi models: excellent benchmarks but disappointing in real use, especially prompt adherence and instruction following.
Phi‑4’s own paper acknowledges weaker performance on strict formatting and detailed instructions.
Some see the main value as a research proof that small models can approach large‑model performance, not necessarily as a top practical model.

Synthetic Data & Model Collapse

Training heavily on synthetic data is a central point of interest.
Multiple replies argue “model collapse” only arises when models are repeatedly trained on their own low‑quality outputs; they say this is different from carefully constructed synthetic datasets plus real data.
Others note image models can collapse under pure synthetic retraining, and that validation signals (compilers, human scoring) help avoid this.

Reasoning, SVG Benchmarks, and Multimodal Tests

A popular informal benchmark is “draw a pelican riding a bicycle” as SVG; Phi‑4’s output here is valid SVG but visually poor.
Commenters use such text‑to‑code‑to‑image tasks as a stress test for compositional reasoning, though some think claims about “SVG as a window to the physical world” are overhyped.
Phi‑4 is described as showing more explicit chain‑of‑thought–style reasoning (considering alternatives) thanks to its training data design.

Prompt Adherence, Multilinguality, and Benchmarks

Phi‑4 is criticized for weaker prompt adherence compared to models like Gemma 2 27B and 9B; Gemma and some others are preferred where strict formatting matters.
8B‑class LLMs marketed as multilingual are said to be far weaker outside English; non‑English outputs are described as often awkward or unusable compared to GPT‑3.5.
Some accuse Phi models of being “trained on benchmarks” and see current hype as “benchmark inflation,” though others point to the detailed technical report for broader evaluation.

“Small” Model Definition and Ecosystem

“Small” generally refers to parameter count (around 7–14B) and the ability to run on consumer hardware, contrasted with 70B+ and 100B+ LLMs.
Inspiration is tied to prior work (e.g., TinyStories) showing that high‑quality, domain‑targeted data can let tiny models outperform much larger, more generic ones.
Tools mentioned for running Phi‑4 and other models locally include GGUF variants via llama.cpp, LM Studio, Ollama, and various prompt templates baked into model files.

Related topics