Farewell Pandas, and thanks for all the fish
What Ibis Is and How It Positions Itself
- Described as a dataframe-style API that can target many backends (DuckDB, Polars, DataFusion, Spark, cloud warehouses, etc.).
- Not an ORM: it doesn’t model entities/relationships, it’s an expression/query layer over tabular data.
- Goal is to reduce vendor lock‑in and let users swap execution engines without rewriting logic.
Deprecating Pandas and Dask Backends
- Ibis is deprecating its pandas and Dask execution backends, but:
- Pandas input/output remains supported (e.g.,
to_pandas()and accepting pandas DataFrames).
- Pandas input/output remains supported (e.g.,
- Rationale given:
- DuckDB (and other engines) can do everything the pandas backend did, faster and more scalably.
- Pandas backend adds technical complexity with little benefit when stronger OLAP engines exist.
- Some see the “Farewell Pandas” messaging as misleading, fearing loss of a “bridge” and marketing downside.
Pandas vs Newer Engines
- Pro‑pandas:
- Ubiquitous ecosystem and integration; “just works” for small/medium tasks and last‑mile processing.
- Features like custom extension dtypes and MultiIndex are valued by some.
- With good style (method chaining, pyarrow dtypes), many users find it effective.
- Anti‑pandas:
- Criticized for API ergonomics, null/NaN handling, performance, and complex internals (especially indexing/MultiIndex).
- Some say it made them think they disliked Python or prefer R/tidyverse.
- Ibis maintainers explicitly reject MultiIndex and implicit ordering as too complex and hard to scale; users must specify
order_by.
Engines: DuckDB, Polars, Dask, and Clusters
- DuckDB is praised as a fast single‑node OLAP engine; better suited to analytics than SQLite, though not a transactional replacement.
- Polars is seen as very fast and memory‑efficient for in‑memory workloads, but:
- Current streaming/out‑of‑core support is reported as weaker than DuckDB/DataFusion by some, with memory/segfault issues noted.
- Others highlight an upcoming new streaming engine.
- Dask is viewed as good for multi‑node or larger‑than‑RAM workloads but often slower than single‑node tools due to overhead; some recommend jumping directly to Spark or other cluster systems at true “big data” scale.
Ecosystem, Churn, and Risk
- Many emphasize that pandas’ massive ecosystem is a key reason to stick with it.
- Some plan to wait years before adopting “neo‑pandas” tools due to fragmentation and rapid churn.
- Others are enthusiastic about Ibis as a single entry point whose syntax can survive backend changes over time.