2025-03-21

I want a good parallel computer

Workloads that might benefit from more parallelism

Suggested candidates: video encoding, large-scale compilation/linking (e.g., Chromium), optimization problems (scheduling, routing), theorem proving, and complex 2D/3D rendering.
Video encoding is split into:
- Real-time (video calls, broadcast): fixed‑function hardware for latency/power, but weaker compression.
- “At rest” (YouTube, Blu‑ray): CPU software for best compression, but slow.
Some argue a GPU-based general encoder could combine software‑grade compression with GPU throughput; others reply that new codecs are rare, and fixed‑function is “good enough.”

GPU strengths, limits, and developer experience

GPUs excel at simple, massively data‑parallel workloads (graphics, linear algebra, ML).
Many proposed workloads are “thinking” tasks with heavy branching and irregular control flow, where GPUs and SIMT/SIMD are a poor fit.
Several comments call GPU programming “weird” and painful:
- Separate compilation and runtime shader builds.
- Distinct memory spaces and data shuffling.
- Synchronization friction and vendor‑specific, complex APIs.
Some believe these issues are largely abstractable at the language/runtime level; others think the underlying execution model is inherently constraining.

Alternative manycore and parallel architectures

Past and niche efforts discussed: Connection Machine, Transputer, Cray MTA, Xeon Phi/Larrabee, GreenArrays, Epiphany, SGI NUMA, AIE arrays, etc.
Repeated theme: “hundreds of tiny CPUs on a chip” usually fail because of odd programming models and poor tooling, not raw hardware.
Cache coherence and shared-memory scaling are called out as core blockers for “a CPU with thousands of worker cores.”
Some advocate graph/DAG or dataflow-style IRs and graph reduction as a better fit than von Neumann-style threads.

Unified memory, APUs, and using GPUs as generic workers

Interest in APUs and unified memory (Apple Silicon, AMD Strix Halo, some Qualcomm/AMD parts) as a friendlier model that avoids PCIe copies.
Debate over AMD marketing claims that Strix Halo can beat an RTX 4090 on large LLMs: critics note these are memory‑bound benchmarks and cherry-picked.
Desire to treat iGPUs as transparent “efficiency cores” scheduled by the OS, but commenters note tool, API, and hardware constraints.

Rendering and dynamic workloads

Some see massively parallel 2D GPU renderers as overkill; others point to complex vector art, maps, text, and fluid 2D UIs that do need serious GPU help.
3D rendering and lighting are highlighted as especially hard: general‑purpose renderers tend to scale poorly with scene complexity, and engines rely on deep integration with scene graphs and precomputation.
The original post’s complaint: GPUs struggle with dynamic, coarse‑grain scheduling and temporary buffer management, and current hardware increasingly accretes special‑case blocks (RT cores, video blocks) instead of general primitives.

Safety, memory models, and historical lessons

Strong pushback against ideas like “flattening address spaces”: people recall unstable, pre‑protection systems and architectures like Cell as cautionary tales.
Counterpoint: many of those designs were limited by their era; modern language and tooling advances (safe languages, IRs like SPIR‑V, JVM/WASM‑style runtimes) could revisit similar ideas more safely.
Some suggest moving more of protection/isolation into software runtimes to simplify hardware and potentially make parallel cores cheaper.

Why a “good parallel computer” is elusive

Ecosystem and economics matter: new architectures struggle without a critical mass of software and experts, even if technically elegant.
Several argue that much day‑to‑day software is bottlenecked by design, I/O, or concurrency, not raw parallel compute; optimizing code or UX often beats moving to GPUs.
Distinction emphasized between parallelism (throughput on homogeneous data) and concurrency (independent, interacting tasks) — most everyday apps are said to need the latter more than the former.

Related topics