Bend: a high-level language that runs on GPUs (via HVM2)
Overview & Goals
- Bend is a high-level, mostly Python-syntax language targeting HVM2, which evaluates interaction nets on CPUs and GPUs.
- Main promise: “everything that can run in parallel, will,” with automatic parallelization of pure functional code, including closures and unrestricted recursion on GPUs.
- A future type layer “Kind2” is planned, analogous to TypeScript over JS, but more integrated and proof-capable.
Language Design: bend/fork and Purity
- Core construct
bendis a structured recursion/loop that expands a computation tree; it is conceptually dual tofold(anamorphism). forkis a special built‑in tied tobend, representing recursive re‑invocation with new state; initialfork(seed)is implicit inbend x = seed.- Variables are immutable; side effects are constrained, which aids automatic parallelization. Some find the Pythonic surface syntax helpful; others think it obscures the functional core or is confusing around
bend’s implicit return semantics.
Performance & Benchmarks
- Claimed results: near‑linear scaling with cores on HVM2, especially on GPUs (e.g., large speedups vs single-thread Bend).
- Several commenters re-ran examples and found:
- Single‑thread VM performance can be extremely slow or buggy on some platforms.
- For a recursive sum example, optimized C/C++/Haskell/Julia can match or beat GPU Bend on commodity CPUs.
- More complex, allocation-heavy tasks (e.g., bitonic sort) show more favorable scaling and can outperform GHC with multiple cores.
- The implementation focuses first on correctness of the parallel evaluator; codegen, tail‑call optimization, and loop lowering are acknowledged as “abysmal/early.”
Limitations & Missing Features
- Current numeric types are 24‑bit (u24/i24/f24) due to packing into 64‑bit interaction‑net nodes; 64‑bit boxed numbers and more numeric/vector types are promised “soon.”
- No native arrays yet; focus is on tree/graph structures. Max heap ~4GB of nodes.
- No tail-call optimization; some examples build huge call stacks.
- Some evaluation semantics (e.g., reducing both sides of conditionals) have correctness caveats in the paper.
Use Cases & Comparisons
- Enthusiasm for: compilers, type checkers, interpreters, evolutionary computation, signal processing, and “general GPU programming without CUDA.”
- Skeptics note: real HPC and ML workloads are tuned to arrays, caches, and BLAS-style kernels; Bend may struggle against specialized CUDA/JAX/Mojo/Futhark.
- Some see it more as a research/existence proof than production‑ready today.
Tooling, Metrics & Backends
- Discussion of how to measure parallelism (time vs “interactions/sec”). Some want FLOPS-like metrics; others argue interactions/sec is the natural unit for interaction nets.
- Suggestions for profiling tools and runtime stats to understand parallelization.
- Current backends: C, CUDA; SPIR‑V/WebGPU/OpenCL and broader GPU support are desired but not yet present.
- FFI exists internally but is not fully exposed; plans include integrating with external GPU kernels and adding textures/strings/arrays.
Community Reaction & Communication
- Many commenters are excited by unrestricted recursion and closures on GPUs and the clarity of the homepage/readme.
- Others criticize: over‑strong marketing phrases (“future of parallel computation,” “near‑ideal speedup”), lack of comparisons against strong baselines (e.g., JAX/Mojo), and confusing examples.
- There’s meta‑discussion about the harshness of early criticism vs the value of honest benchmarks and clear disclaimers; suggestions include moving performance caveats higher in docs and showing both “weak” and “strong” examples.