2025-01-09

Phi 4 available on Ollama

Availability, Formats, and Bug Fixes

Phi-4 is now an official Ollama model; community ports existed earlier, including versions with Unsloth’s bug fixes.
Some GGUF builds on Hugging Face had inference errors due to Phi-4’s architecture diverging from Phi-3.5 while reusing the “phi3” identifier; Ollama’s build adjusts hyperparameters to avoid this.
Users can pull GGUFs directly from Hugging Face into Ollama (e.g., specifying quantization like :Q8_0), but nontrivial models (vision, special schemas) may need custom Modelfiles.
Future Ollama releases are expected to resolve the GGUF hyperparameter error generally.

Quality, Benchmarks, and Evaluation Methods

Several users say earlier Phi models underperformed relative to benchmarks, but report Phi-4 (14B) as a major step up, “GPT‑4-class” for many tasks and strong in languages like Japanese.
One benchmark on the top 1,000 StackOverflow questions ranked Phi-4 3rd, above GPT‑4 and Claude Sonnet 3.5, but it used Mixtral 8x7B as an automated judge, which is controversial.
Critics argue LLM-as-judge tends to favor its own lineage and insist human evaluation is the only solid standard; others counter that LLM grading plus user votes is “good enough” for relative model ranking.
Phi-4 scores relatively poorly on IFEval (instruction-following with strict constraints), flagged as a concern for constrained outputs.
A separate case study shows Phi-4 can match GPT‑4o’s decisions ~97% of the time on a complex task when given high-quality few-shot examples, vs ~37% without few-shot.

Local Performance and Ecosystem

Multiple users are “blown away” that GPT‑4-like models now run locally (e.g., on M1/M2/M3 Macs with ≥16 GB RAM), though speeds vary and some report issues (e.g., blank outputs on certain setups).
Phi-4’s 14B size plus strong reasoning is seen as a turning point for practical local NLP, RAG, and coding assistance; compared favorably to Qwen 2/2.5 and Llama 3.3 70B.
Some express dissatisfaction with Ollama/llama.cpp (limited multimodal support, no Vulkan in Ollama) and are exploring vLLM as an alternative.

Business, Strategy, and Licensing

Phi-4 is MIT-licensed and available via OpenRouter, enabling cheap hosted access and easy self-hosting.
Discussion suggests major cloud providers see models as increasingly commoditized and focus on infra and integrated products, contrasting with OpenAI’s more closed approach.
Some view Microsoft’s open releases as a hedge against OpenAI and evidence that proprietary model moats are weak; others note these are “non-SOTA” but still strategically useful.

Technical Design, Training Data, and Legality

Phi-4’s strong performance despite its size is attributed (per its technical report) to highly curated, largely synthetic data (textbooks, problem sets) instead of massive web dumps.
This raises the question of whether training avoided copyright infringement; responses note that legality is unclear and may hinge on “fair use,” regardless of user perception.

Structured Outputs and Practical Use

Ollama recently added structured output support; users report it works reasonably if schemas are simple, though not as robust as OpenAI-style constrained decoding.
Third-party tools (e.g., BAML) are cited as improving JSON reliability across providers.
Some minor quirks are noted (e.g., Markdown code fencing styles), possibly reflecting training data habits.

Broader Societal and Future Concerns

Several comments marvel at the pace: powerful local models, high-quality image/video generation, and imminent voice-to-voice assistants.
There is sharp disagreement on long-term impacts: some expect tools that augment humans; others foresee severe job displacement, social instability, and AI-enabled weapons development.
Many anticipate AI becoming a generic “feature” in all products rather than a standalone destination, which may challenge API-centric businesses.

Related topics