2025-11-25

Python is not a great language for data science

Scope and thesis of the article

Many commenters think the piece is well written but under-argued: the main post mostly contrasts Python vs R code snippets and only fully states its thesis in a sequel (Python’s issues for data science: reference semantics, no built-in missing values, no built-in vectorization, no non‑standard evaluation).
Some find the examples weak or contrived (e.g., manually computing means/SDs instead of using statistics or NumPy), arguing this exaggerates Python’s shortcomings.

Why Python dominates data science

Strong consensus that Python’s success is driven by ecosystem and network effects, not inherent suitability:
- Huge library support (NumPy, pandas/Polars, scikit‑learn, PyTorch, Jupyter, etc.).
- General‑purpose “glue” language: OK at scraping, file and format handling, orchestration, and integration with databases, C/C++/Fortran, GPUs.
- Easy for non‑programmers and cross‑discipline teams; code is widely readable and reviewable.
Several note that hiring, teaching, and production engineering all strongly favor Python; R, SAS, Matlab, etc. are seen as niche or expensive.

R vs Python in practice

Many practitioners use both:
- R (especially tidyverse/data.table + ggplot) favored for exploratory analysis, tabular wrangling, and plotting; code often shorter and closer to statistical thinking.
- Python preferred for “logistics”: file juggling, large‑scale pipelines, reproducible deployments, and integration into larger software systems.
Productionizing R is widely described as painful; common pattern is prototype in R, rewrite in another language.
Others push back that R has serious quirks (non‑standard evaluation, indexing oddities, silent NA behaviors) and can be fragile for larger software.

Tables, dataframes, and language design

A big subthread argues the real problem is that mainstream languages don’t treat tables/dataframes as first‑class citizens; instead users learn mini‑languages (pandas, dplyr, Polars).
Suggestions and examples span SQL, q/kdb, Clojure, Rye, Lil, Nushell, APL, Matlab, Julia, Fortran, and Excel‑style tools.
Some think SQL + tools like DuckDB are a cleaner core for tabular work, with Python or R around the edges; others prefer staying in a dataframe‑centric DSL.

Broader language comparisons and “good enough”

Multiple commenters claim no current language is truly “great” for data science; Python and R are both compromises.
Julia, Clojure, Kotlin, Nim, SAS, Matlab, and even shell pipelines are mentioned as promising or domain‑strong but lacking Python’s momentum.
Common conclusion: Python isn’t the best for data science, but it’s “good enough” at nearly everything and wins on ubiquity, tooling, and ecosystem.

Related topics