2024-11-26

What happens if we remove 50 percent of Llama?

Impact on inference and hardware constraints

Many see 50% sparsity as a big win for running larger models on consumer GPUs, since VRAM is usually the bottleneck and weights dominate VRAM use.
Example: a ~32B model at 4‑bit uses ~16–18 GB VRAM for weights, but full 32k context can add ~10 GB for activations; sparsity could free VRAM either for larger models or longer context.
Sparse models are seen as beneficial for “low‑end” GPUs and midrange cards (e.g., 16 GB consumer GPUs), though some argue high‑end Macs and expensive GPUs aren’t really “consumer” hardware.

Sparsity vs quantization and benchmarks

Discussion questions whether the same quality/speed/size tradeoffs could be achieved with quantization plus fine‑tuning, without sparsity.
A few readers want charts combining inference speed, VRAM, and quality to directly compare “sparse + maybe higher bits” vs “denser + lower bits,” but this isn’t provided.
Some wonder about out‑of‑sample robustness of sparse models and how far you can prune before accuracy and generalization collapse.

Mixture-of-Experts and modular models

One line of discussion asks if domain‑specific smaller models could be combined at runtime.
Mixture‑of‑Experts is presented as the closest current approach, but commenters stress experts aren’t clean domain modules and routing behavior is poorly understood and often per‑token.
Others mention related ideas: speculative decoding (clarifying it’s about speed, not domains), task arithmetic (combining task‑specific finetunes), and ensemble/portfolio methods from classical ML.

LLM understanding and reasoning

Some argue LLMs are “well understood” mathematically; others say we still lack deeper insight into how parameters encode concepts, analogous to gaps in understanding human cognition.
A side debate references a paper claiming transformers lack true reasoning; critics note that larger models (including frontier ones) perform much better on those benchmarks, so conclusions based on small models are disputed.

Biological analogies and pruning

Several liken 50% pruning to synaptic pruning and neural redundancy in the brain, citing silent neurons and developmental pruning.
Others warn against overinterpreting the analogy: pruning clearly helps ANNs, but biological mechanisms and memory formation remain poorly understood and very different from backprop.
There’s speculation about two‑phase “train large, then compress” strategies, tying in lottery‑ticket ideas and overparameterization as a path to better optimization.

Scaling limits, redundancy, and “the wall”

One view: heavy sparsity shows large networks are highly redundant, and future scaling laws should factor in efficiency/entropy, not just size and compute.
Counterpoint: the pruned weights weren’t “gibberish” because performance did drop; you can’t naively train directly into the final sparse configuration.
Another thread suggests the real scaling “wall” is data, not parameters: organic, high‑quality data grows roughly linearly, while model/compute scaling has been exponential. Synthetic data and user–LLM logs may help but don’t fix this fundamental mismatch.
Multimodal data (e.g., video) is noted as an underused source, but also expensive and possibly less abstract than text.

Autism metaphor dispute

A commenter jokingly equates a 2% accuracy loss or heavy pruning with “functioning autism.”
Others strongly push back, clarifying autism is not equivalent to low intellect or generic impairment, and object to using “autism” as a casual synonym for degradation.
This broadens into discussion of autism subtypes, co‑occurring intellectual disability, and lived experience, with disagreement over whether neurodivergence is “something wrong” vs simply different.

Open technical questions and skepticism

Readers ask what exactly “2:4 sparsity” means in practice and whether the pruned pattern is random or structured; this remains unclear in the thread.
There’s curiosity about whether a sparse matrix can be reorganized into a smaller dense model, and if repeated pruning (beyond 50%) plus accepting more inaccuracy could still yield useful mini‑models; back‑of‑the‑envelope Pareto arguments are treated as clearly over‑optimistic.
Some note hardware vendors have supported structured sparsity for years, implying the engineering and algorithmic details are nontrivial despite the appealing headline result.

Related topics