2024-08-29

Farewell Pandas, and thanks for all the fish

What Ibis Is and How It Positions Itself

Described as a dataframe-style API that can target many backends (DuckDB, Polars, DataFusion, Spark, cloud warehouses, etc.).
Not an ORM: it doesn’t model entities/relationships, it’s an expression/query layer over tabular data.
Goal is to reduce vendor lock‑in and let users swap execution engines without rewriting logic.

Deprecating Pandas and Dask Backends

Ibis is deprecating its pandas and Dask execution backends, but:
- Pandas input/output remains supported (e.g., to_pandas() and accepting pandas DataFrames).
Rationale given:
- DuckDB (and other engines) can do everything the pandas backend did, faster and more scalably.
- Pandas backend adds technical complexity with little benefit when stronger OLAP engines exist.
Some see the “Farewell Pandas” messaging as misleading, fearing loss of a “bridge” and marketing downside.

Pandas vs Newer Engines

Pro‑pandas:
- Ubiquitous ecosystem and integration; “just works” for small/medium tasks and last‑mile processing.
- Features like custom extension dtypes and MultiIndex are valued by some.
- With good style (method chaining, pyarrow dtypes), many users find it effective.
Anti‑pandas:
- Criticized for API ergonomics, null/NaN handling, performance, and complex internals (especially indexing/MultiIndex).
- Some say it made them think they disliked Python or prefer R/tidyverse.
Ibis maintainers explicitly reject MultiIndex and implicit ordering as too complex and hard to scale; users must specify order_by.

Engines: DuckDB, Polars, Dask, and Clusters

DuckDB is praised as a fast single‑node OLAP engine; better suited to analytics than SQLite, though not a transactional replacement.
Polars is seen as very fast and memory‑efficient for in‑memory workloads, but:
- Current streaming/out‑of‑core support is reported as weaker than DuckDB/DataFusion by some, with memory/segfault issues noted.
- Others highlight an upcoming new streaming engine.
Dask is viewed as good for multi‑node or larger‑than‑RAM workloads but often slower than single‑node tools due to overhead; some recommend jumping directly to Spark or other cluster systems at true “big data” scale.

Ecosystem, Churn, and Risk

Many emphasize that pandas’ massive ecosystem is a key reason to stick with it.
Some plan to wait years before adopting “neo‑pandas” tools due to fragmentation and rapid churn.
Others are enthusiastic about Ibis as a single entry point whose syntax can survive backend changes over time.

Related topics