1.5 TB of VRAM on Mac Studio – RDMA over Thunderbolt 5
Wishlist for Future Macs and Hardware Limits
- Some expect M5 Max/Ultra to offer DGX-style high-speed links (QSFP 200–400 Gb/s), 1 TB unified memory, >1 TB/s memory bandwidth, serious neural accelerators, and even higher power envelopes than current ~250 W caps.
- Others see QSFP and 600 W desktops as unrealistic given Apple’s consumer focus and prior neglect of pro/server markets.
Apple’s Enterprise / Datacenter Strategy
- Several comments argue Apple has never treated datacenter/enterprise as a serious, high-margin market; past products like Xserve and Xserve RAID lagged true enterprise gear.
- Others counter that Apple now runs its own Apple‑silicon servers (including for Private Cloud Compute), with custom multi‑chip boards and MLX5 NICs, and that features like Thunderbolt RDMA are likely downstream of internal needs.
- There’s skepticism Apple will ever sell those server-class machines publicly, though some hope leadership changes could revive pro/server hardware.
RDMA over Thunderbolt vs Ethernet / InfiniBand
- RDMA over TB5 yields ~30–50 µs latency, versus ~300 µs for TCP over TB; commenters expect similar latency for TCP over 200 GbE.
- A QSFP + 200–400 GbE switch setup could add nodes and bandwidth but at higher cost, power, and some extra latency; debate centers on how significant that latency hit is.
- RoCE (RDMA over Ethernet) is raised as a competitor; macOS apparently supports MLX5 but not RoCE today. InfiniBand is cited as traditional low‑latency RDMA, but there’s a trend towards Ethernet + RoCE in new AI/HPC clusters.
Cluster Topology and Thunderbolt Limits
- TB5 requires a full mesh for low-latency memory access; daisy-chaining would saturate intermediate links and add latency.
- Confusion over port limits (3 vs 6) is clarified: current hardware can use all six TB ports; earlier statements were about an initial software/rollout limit.
- Lack of Thunderbolt switches caps scale; some speculate about using TB-to-PCIe to attach traditional NICs or future CXL-like solutions.
Neural Accelerators and Software Ecosystem
- Apple Neural Engine exists with INT8/FP16 MACs, but tooling is seen as weak (CoreML/ONNX only, no good native programming model).
- Some argue Apple should fund deep framework support (beyond prior TensorFlow work), especially for attention/FlashAttention‑style kernels and “neural accelerators” on the GPU.
Power, Overclocking, and Efficiency
- One camp wants overclockable Macs and is “okay” with 600+ W draw to squeeze every ounce of performance from limited hardware budgets.
- Another camp strongly pushes back: modern chips are already near optimal; doubling/tripling power for +10–20% gain is called wasteful and contrary to good engineering, except in rare non‑scalable workloads.
- Experiences from crypto mining and undervolting are cited to show how dramatically efficiency improves when power is reduced.
Use Cases, Value, and Alternatives
- Some see the demo (a very expensive local chatbot rig) as underwhelming compared to what’s possible: large‑scale image/video generation, MoE and 70B fine‑tuning, etc.
- Others highlight the appeal of local, privacy‑preserving assistants that can act on personal data (messages, history) and frustrations with web search pushing people to LLMs for “facts.”
- Comparisons are made to GB10/GB300 and other NVIDIA workstations: they may match or beat 3090‑class performance and interconnects, but with shorter product lifecycles and worse general‑desktop experience vs long‑lived Macs.
Scaling Limits and Model Size
- Discussion of DeepSeek‑class (700 GB) models notes only modest speedups going from one 512 GB node to multiple nodes, because TB5 bandwidth (80 Gb/s) is far slower than local memory.
- Debate over whether weights can “just be memory‑mapped” to SSD: many argue that for dense models you effectively need all weights every token, so SSD paging quickly becomes a severe bottleneck, even if MoE can help distribution.