Phi 4 available on Ollama

Availability, Formats, and Bug Fixes

  • Phi-4 is now an official Ollama model; community ports existed earlier, including versions with Unsloth’s bug fixes.
  • Some GGUF builds on Hugging Face had inference errors due to Phi-4’s architecture diverging from Phi-3.5 while reusing the “phi3” identifier; Ollama’s build adjusts hyperparameters to avoid this.
  • Users can pull GGUFs directly from Hugging Face into Ollama (e.g., specifying quantization like :Q8_0), but nontrivial models (vision, special schemas) may need custom Modelfiles.
  • Future Ollama releases are expected to resolve the GGUF hyperparameter error generally.

Quality, Benchmarks, and Evaluation Methods

  • Several users say earlier Phi models underperformed relative to benchmarks, but report Phi-4 (14B) as a major step up, “GPT‑4-class” for many tasks and strong in languages like Japanese.
  • One benchmark on the top 1,000 StackOverflow questions ranked Phi-4 3rd, above GPT‑4 and Claude Sonnet 3.5, but it used Mixtral 8x7B as an automated judge, which is controversial.
  • Critics argue LLM-as-judge tends to favor its own lineage and insist human evaluation is the only solid standard; others counter that LLM grading plus user votes is “good enough” for relative model ranking.
  • Phi-4 scores relatively poorly on IFEval (instruction-following with strict constraints), flagged as a concern for constrained outputs.
  • A separate case study shows Phi-4 can match GPT‑4o’s decisions ~97% of the time on a complex task when given high-quality few-shot examples, vs ~37% without few-shot.

Local Performance and Ecosystem

  • Multiple users are “blown away” that GPT‑4-like models now run locally (e.g., on M1/M2/M3 Macs with ≥16 GB RAM), though speeds vary and some report issues (e.g., blank outputs on certain setups).
  • Phi-4’s 14B size plus strong reasoning is seen as a turning point for practical local NLP, RAG, and coding assistance; compared favorably to Qwen 2/2.5 and Llama 3.3 70B.
  • Some express dissatisfaction with Ollama/llama.cpp (limited multimodal support, no Vulkan in Ollama) and are exploring vLLM as an alternative.

Business, Strategy, and Licensing

  • Phi-4 is MIT-licensed and available via OpenRouter, enabling cheap hosted access and easy self-hosting.
  • Discussion suggests major cloud providers see models as increasingly commoditized and focus on infra and integrated products, contrasting with OpenAI’s more closed approach.
  • Some view Microsoft’s open releases as a hedge against OpenAI and evidence that proprietary model moats are weak; others note these are “non-SOTA” but still strategically useful.

Technical Design, Training Data, and Legality

  • Phi-4’s strong performance despite its size is attributed (per its technical report) to highly curated, largely synthetic data (textbooks, problem sets) instead of massive web dumps.
  • This raises the question of whether training avoided copyright infringement; responses note that legality is unclear and may hinge on “fair use,” regardless of user perception.

Structured Outputs and Practical Use

  • Ollama recently added structured output support; users report it works reasonably if schemas are simple, though not as robust as OpenAI-style constrained decoding.
  • Third-party tools (e.g., BAML) are cited as improving JSON reliability across providers.
  • Some minor quirks are noted (e.g., Markdown code fencing styles), possibly reflecting training data habits.

Broader Societal and Future Concerns

  • Several comments marvel at the pace: powerful local models, high-quality image/video generation, and imminent voice-to-voice assistants.
  • There is sharp disagreement on long-term impacts: some expect tools that augment humans; others foresee severe job displacement, social instability, and AI-enabled weapons development.
  • Many anticipate AI becoming a generic “feature” in all products rather than a standalone destination, which may challenge API-centric businesses.