Liquid Foundation Models: Our First Series of Generative AI Models

Positioning, Openness & Release Strategy

  • Company markets an “open-science” approach (papers, methods, some data) but is not open-sourcing weights; this is widely criticized as a missed opportunity, especially for small models that are most useful when runnable locally.
  • API-only access while comparing against open models is seen as inconsistent; some ask what the point of highlighting reduced memory footprint is if users can’t self-host.
  • Lack of a detailed technical paper at launch frustrates many; current info is mainly a citation list and high-level claims.
  • Several comments accuse the benchmark presentation of cherry-picking (e.g., omitting strong baselines like Qwen2.5 14B, emphasizing only favorable metrics, and using visual tricks in charts).

Architecture & Novelty Claims

  • The models are presented as non‑transformer “Liquid Foundation Models,” drawing on liquid neural network and neural ODE–style research.
  • Some users are excited by any credible non‑transformer alternative (alongside Mamba, Hyena, RWKV, etc.).
  • Others find the public explanation vague (token-mixing, channel-mixing, “featurization” buzzwords) and want concrete architectural details and ablations.

Observed Behavior & Capability

  • Speed is praised: responses feel near‑instant compared to many current APIs.
  • Quality is mixed:
    • Good at trivia, light essays, and simple medical/engineering questions; style can be engaging.
    • Frequently fails at basic logic, numeric reasoning, and coding tasks; people report GPT‑2‑like mistakes.
  • Numerous prompts cause obvious failures: infinite loops, repeated lines, crashes, or “try again later” errors (e.g., multilingual poetry, tricky formatting constraints, asking for current time/date).
  • Classic failure modes appear: bad answers to simple word problems, misquoting book openings, strong hallucinations (e.g., fabricated death of a public figure).

Counting, Math & Tokenization Debate

  • The model itself lists “precise numerical calculations” and counting letters in “strawberry” as weaknesses, yet marketing copy also claims strength in “mathematics and logical reasoning,” which some call out as inconsistent.
  • Long subthread debates whether letter-count failures are meaningful:
    • One side: trivial tasks computers have solved for decades; failures expose serious limitations.
    • Other side: character-level tasks clash with token-based training and are more about architecture/tokenization than “intelligence.”
  • Several users note practical workarounds: have the model write and run code for counting or date math instead of relying on its internal arithmetic.

Use Cases, Context Length & Market Fatigue

  • Some see small, efficient models with long effective context as the next frontier, especially for whole-codebase tasks and multimodal parsing.
  • Others argue small models must be open to matter; otherwise APIs for larger frontier models are already “cheap enough.”
  • General fatigue with yet‑another‑model launches appears, with calls to focus more on actual products and less on marginal new chatbots.