2024-06-12

How Meta trains large language models at scale

Hardware Choices: GPUs vs TPUs and Custom Silicon

Large part of the thread debates Nvidia GPUs vs Google TPUs and other custom accelerators.
Some argue Google’s multi‑generation TPU effort plus internal ML research should make it the long‑term winner.
Others counter that Nvidia’s chips are faster for LLM training, paired with CUDA and a massive software ecosystem, making it hard to displace.
There’s disagreement on whether TPUs are mainly for inference or also competitive for training; benchmark papers and vendor claims are cited on both sides.
Multiple companies (Google, Meta, Microsoft, Apple, AWS) are noted as building their own AI chips primarily to cut internal costs, not necessarily to sell widely.

Consumer GPUs vs Data Center GPUs

One subthread disputes the idea that big players are “racking 4090s.”
Some insist it happens out of necessity due to H100 scarcity; others call it technically and economically “stupid” for large‑scale training (VRAM limits, no NVLink, PCIe bottlenecks, cooling, power, lack of ECC, licensing).
Smaller GPU clouds are cited as offering multi‑4090 servers anyway.
Anecdote: a company has ~400 H100s sitting idle, prompting shock and many offers to rent or use them.

Networking and Cluster Architecture

Meta reportedly built two 24k‑GPU clusters: one InfiniBand, one RoCE/Ethernet, partly to compare them.
Commenters highlight that Ethernet + RoCE with commodity switches can be much cheaper than Mellanox InfiniBand and may be “good enough” at scale.
Some argue the true moat is full‑system design (interconnects, orchestration, fault tolerance), not just the chip.

Scheduling and Software Stack

Readers complain the article is vague on “efficient scheduling.”
People speculate they use standard HPC schedulers (e.g., Slurm) and note that some large AI orgs run training on Kubernetes, sometimes layering Slurm on top.

Meta’s Business Model and Product Use

Unclear to several commenters how Meta will directly monetize LLMs.
Proposed uses: better ad targeting, large‑scale content moderation, WhatsApp customer support bots, and VR/AR worlds populated with AI avatars.
Some suspect internal models are stronger than open Llama releases; others note Meta plans to release very large models too.

Data, PII, and Training Sources

Questions raised about whether Meta trains on user content and how PII is handled.
One perspective: internal infra hides PII by default; access to fields like user IDs is tightly controlled.
Others point to opt‑out‑style data collection for AI and question whether aggregate behavior is effectively PII.
LLaMA papers are cited as claiming use of public datasets, avoidance of Meta product data, and filtering out obvious PII sources.

Operational Challenges and Humor

The phrase “GPU falling off the bus” sparks jokes but refers to real PCIe detection failures; commenters note it’s an actual Nvidia driver message.
People relate that even massive orgs fight the same hardware issues (riser cards, cabling, cooling) as smaller on‑prem setups.

Broader Reflections on AI and Meta

Some see dedicating supercomputer‑scale clusters to ad‑driven LLMs as wasteful; others reply that for‑profit work drives the economy and Meta’s rapid deployment is technically impressive.
There’s recurring skepticism about Google’s ability to capitalize on its tech advantages, and parallel skepticism about Meta’s alignment of incentives (e.g., improving search vs maximizing engagement).

Related topics