Farewell Pandas, and thanks for all the fish

What Ibis Is and How It Positions Itself

  • Described as a dataframe-style API that can target many backends (DuckDB, Polars, DataFusion, Spark, cloud warehouses, etc.).
  • Not an ORM: it doesn’t model entities/relationships, it’s an expression/query layer over tabular data.
  • Goal is to reduce vendor lock‑in and let users swap execution engines without rewriting logic.

Deprecating Pandas and Dask Backends

  • Ibis is deprecating its pandas and Dask execution backends, but:
    • Pandas input/output remains supported (e.g., to_pandas() and accepting pandas DataFrames).
  • Rationale given:
    • DuckDB (and other engines) can do everything the pandas backend did, faster and more scalably.
    • Pandas backend adds technical complexity with little benefit when stronger OLAP engines exist.
  • Some see the “Farewell Pandas” messaging as misleading, fearing loss of a “bridge” and marketing downside.

Pandas vs Newer Engines

  • Pro‑pandas:
    • Ubiquitous ecosystem and integration; “just works” for small/medium tasks and last‑mile processing.
    • Features like custom extension dtypes and MultiIndex are valued by some.
    • With good style (method chaining, pyarrow dtypes), many users find it effective.
  • Anti‑pandas:
    • Criticized for API ergonomics, null/NaN handling, performance, and complex internals (especially indexing/MultiIndex).
    • Some say it made them think they disliked Python or prefer R/tidyverse.
  • Ibis maintainers explicitly reject MultiIndex and implicit ordering as too complex and hard to scale; users must specify order_by.

Engines: DuckDB, Polars, Dask, and Clusters

  • DuckDB is praised as a fast single‑node OLAP engine; better suited to analytics than SQLite, though not a transactional replacement.
  • Polars is seen as very fast and memory‑efficient for in‑memory workloads, but:
    • Current streaming/out‑of‑core support is reported as weaker than DuckDB/DataFusion by some, with memory/segfault issues noted.
    • Others highlight an upcoming new streaming engine.
  • Dask is viewed as good for multi‑node or larger‑than‑RAM workloads but often slower than single‑node tools due to overhead; some recommend jumping directly to Spark or other cluster systems at true “big data” scale.

Ecosystem, Churn, and Risk

  • Many emphasize that pandas’ massive ecosystem is a key reason to stick with it.
  • Some plan to wait years before adopting “neo‑pandas” tools due to fragmentation and rapid churn.
  • Others are enthusiastic about Ibis as a single entry point whose syntax can survive backend changes over time.