Bend: a high-level language that runs on GPUs (via HVM2)

Overview & Goals

  • Bend is a high-level, mostly Python-syntax language targeting HVM2, which evaluates interaction nets on CPUs and GPUs.
  • Main promise: “everything that can run in parallel, will,” with automatic parallelization of pure functional code, including closures and unrestricted recursion on GPUs.
  • A future type layer “Kind2” is planned, analogous to TypeScript over JS, but more integrated and proof-capable.

Language Design: bend/fork and Purity

  • Core construct bend is a structured recursion/loop that expands a computation tree; it is conceptually dual to fold (anamorphism).
  • fork is a special built‑in tied to bend, representing recursive re‑invocation with new state; initial fork(seed) is implicit in bend x = seed.
  • Variables are immutable; side effects are constrained, which aids automatic parallelization. Some find the Pythonic surface syntax helpful; others think it obscures the functional core or is confusing around bend’s implicit return semantics.

Performance & Benchmarks

  • Claimed results: near‑linear scaling with cores on HVM2, especially on GPUs (e.g., large speedups vs single-thread Bend).
  • Several commenters re-ran examples and found:
    • Single‑thread VM performance can be extremely slow or buggy on some platforms.
    • For a recursive sum example, optimized C/C++/Haskell/Julia can match or beat GPU Bend on commodity CPUs.
    • More complex, allocation-heavy tasks (e.g., bitonic sort) show more favorable scaling and can outperform GHC with multiple cores.
  • The implementation focuses first on correctness of the parallel evaluator; codegen, tail‑call optimization, and loop lowering are acknowledged as “abysmal/early.”

Limitations & Missing Features

  • Current numeric types are 24‑bit (u24/i24/f24) due to packing into 64‑bit interaction‑net nodes; 64‑bit boxed numbers and more numeric/vector types are promised “soon.”
  • No native arrays yet; focus is on tree/graph structures. Max heap ~4GB of nodes.
  • No tail-call optimization; some examples build huge call stacks.
  • Some evaluation semantics (e.g., reducing both sides of conditionals) have correctness caveats in the paper.

Use Cases & Comparisons

  • Enthusiasm for: compilers, type checkers, interpreters, evolutionary computation, signal processing, and “general GPU programming without CUDA.”
  • Skeptics note: real HPC and ML workloads are tuned to arrays, caches, and BLAS-style kernels; Bend may struggle against specialized CUDA/JAX/Mojo/Futhark.
  • Some see it more as a research/existence proof than production‑ready today.

Tooling, Metrics & Backends

  • Discussion of how to measure parallelism (time vs “interactions/sec”). Some want FLOPS-like metrics; others argue interactions/sec is the natural unit for interaction nets.
  • Suggestions for profiling tools and runtime stats to understand parallelization.
  • Current backends: C, CUDA; SPIR‑V/WebGPU/OpenCL and broader GPU support are desired but not yet present.
  • FFI exists internally but is not fully exposed; plans include integrating with external GPU kernels and adding textures/strings/arrays.

Community Reaction & Communication

  • Many commenters are excited by unrestricted recursion and closures on GPUs and the clarity of the homepage/readme.
  • Others criticize: over‑strong marketing phrases (“future of parallel computation,” “near‑ideal speedup”), lack of comparisons against strong baselines (e.g., JAX/Mojo), and confusing examples.
  • There’s meta‑discussion about the harshness of early criticism vs the value of honest benchmarks and clear disclaimers; suggestions include moving performance caveats higher in docs and showing both “weak” and “strong” examples.