2024-08-14

Grok-2 Beta Release

Model quality & benchmarks

Many see Grok‑2 as a big step up from Grok‑1 and now “near the top” with OpenAI, Anthropic, Google, Meta.
Thread cites LMSYS / benchmark tables: Grok‑2 appears better than most models except Claude 3.5 Sonnet, GPT‑4o, and Gemini 1.5 Pro, though different models win different tests.
Some are skeptical it’s more than benchmark‑tuning; others say it’s clearly “good enough to be worth using today.”
Several people compare it directly to Claude 3.5 Sonnet and still rank Claude ahead for coding and general quality.

API, lock‑in & business risk

Excitement about “one more top model via API,” with hope this drives prices down.
Counterpoint: many distrust X as a platform after Twitter’s API history and Musk’s erratic management; seen as risky to build a business on.
Others argue LLM APIs are relatively swappable (string in/out, use abstraction layers) so lock‑in is limited, though system prompts and task‑specific performance complicate migrations.

Data use, tweets & GDPR

Strong criticism that Twitter/X irreversibly feeds user data into Grok without clear consent, especially under EU law.
Explanation: once data is baked into model weights it can’t be selectively removed without retraining.
Debate over whether X’s ToS grants a broad enough license for AI training and how GDPR’s purpose‑limitation and consent rules apply; some insist it’s unlawful in the EU, others say data is in US datacenters and X’s lawyers seem unconcerned.

Safety, censorship & political bias

Users ask how “censored” Grok is compared with other LLMs; some want fewer refusals.
Reports: image generation blocks nudity but freely creates some shocking or political content, while seeming to over‑sanitize or distort some LGBT prompts.
Disagreement over whether reduced “safety” boosts benchmarks, and whether Grok is meaningfully less censored than Claude or “uncensored” open‑source models.
Political alignment and “alt‑right AI” fears are raised; others push back on assuming bias without testing.

EU regulation & regional availability

Frustration from Europeans about staggered launches and delayed features; some say they’ll just use VPNs.
Others defend EU privacy and consumer protections, arguing US companies must obey EU law if they operate there.
Debate over whether EU regulation is “dumb” in implementation (e.g., cookie popups) vs. necessary to constrain data‑mining.

Ethics, Musk, and consistency

Repeated criticism that xAI’s actions contradict earlier complaints about OpenAI: Grok is not open‑source, still a frontier model, and uses tweets for training.
Some attempt to rationalize: Grok‑1 was open‑sourced after a lag and no one else paused development, so xAI followed suit.
Long sub‑thread disputes Musk’s ethics, truth‑seeking claims, and “free speech” positioning, citing past behavior and moderation choices on X.

Open‑source & local models

A few hope for an open release of Grok‑2 similar to Grok‑1, but optimism is low.
Some argue the real moat is high‑quality data and massive compute, not code alone.
Others say they only care once a small Grok model is downloadable and quantizable for fast local use; until then they’ll stick with Meta/Mistral and other open options.

Benchmarks & evaluation limits

Chatbot Arena rankings are viewed with increasing skepticism: possible tuning, sample bias (English‑heavy, “vibes”‑driven), and ease of gaming.
Alternative niche or task‑specific benchmarks are suggested (coding, search, LiveBench), but consensus is that standardized, robust LLM evaluation remains unsolved.

Related topics