2025-07-10

Grok 4 Launch [video]

Benchmarks, Capabilities & Architecture

Many commenters note Grok 4’s very strong benchmark results (Humanity’s Last Exam, ARC-AGI 1/2, GPQA, AIME, USAMO, LiveCodeBench, NYT Connections) and suggest it may be short‑term SOTA.
Others are skeptical: concerns about benchmark contamination, training specifically on benchmark‑style data, multiple‑choice bias, and the scientific validity of some tests (especially HLE).
ARC-AGI-2 performance (15.9%) is seen as especially interesting but some still suspect targeted training rather than “general reasoning”.
Grok 4 Heavy’s main advertised trick is running multiple agents in parallel and aggregating their outputs; people compare this to mixture‑of‑experts, o1/o3‑style “thinking” and agentic tool loops. Some see it as clever; others call it brute‑force scaling and a sign of plateauing core model quality.

Real-World Use, Integrations & Pricing

Early users report Grok 4 is strong for reasoning, research, and some coding; integration exists via grok.com, X app, OpenRouter, Azure, Cursor, and agents like Aider/Cline.
Coding experience is mixed: some praise deep, technical responses; others find it slow, and still prefer Claude/Gemini in IDE-integrated agents.
Voice mode impresses for multilingual support but turn detection and UX need work.
Heavy is ~$300/month and regular Grok 4 requires paid tiers; this fuels a broader discussion that frontier “thinking” models are getting more expensive despite earlier expectations of falling prices.

Trust, Safety & Politics

A major thread is distrust due to recent “MechaHitler” / Nazi output and earlier racist/antisemitic behavior; several see Grok as a professional or reputational liability.
Some argue this was mainly a system‑prompt issue; others point to right‑leaning fine‑tuning and Musk’s public politics as deeper concerns.
Debate over “censorship vs freedom”: some value fewer safety rails; others view Musk’s intervention as just a different, ideologically driven form of censorship.
Several say they will not adopt Grok regardless of quality; others separate technical merit from politics and are eager to use the model.

Meta & Industry Context

Many view Musk’s claims about “new physics/technologies” as typical overpromising and discount the launch rhetoric.
Some feel HN discussions of negative Grok incidents get suppressed compared to technical launches.
There’s broad agreement that competition (Grok vs OpenAI/Anthropic/Google) is accelerating model progress, even if no one trusts any single benchmark or vendor completely.

Related topics