Python is not a great language for data science
Scope and thesis of the article
- Many commenters think the piece is well written but under-argued: the main post mostly contrasts Python vs R code snippets and only fully states its thesis in a sequel (Python’s issues for data science: reference semantics, no built-in missing values, no built-in vectorization, no non‑standard evaluation).
- Some find the examples weak or contrived (e.g., manually computing means/SDs instead of using
statisticsor NumPy), arguing this exaggerates Python’s shortcomings.
Why Python dominates data science
- Strong consensus that Python’s success is driven by ecosystem and network effects, not inherent suitability:
- Huge library support (NumPy, pandas/Polars, scikit‑learn, PyTorch, Jupyter, etc.).
- General‑purpose “glue” language: OK at scraping, file and format handling, orchestration, and integration with databases, C/C++/Fortran, GPUs.
- Easy for non‑programmers and cross‑discipline teams; code is widely readable and reviewable.
- Several note that hiring, teaching, and production engineering all strongly favor Python; R, SAS, Matlab, etc. are seen as niche or expensive.
R vs Python in practice
- Many practitioners use both:
- R (especially tidyverse/data.table + ggplot) favored for exploratory analysis, tabular wrangling, and plotting; code often shorter and closer to statistical thinking.
- Python preferred for “logistics”: file juggling, large‑scale pipelines, reproducible deployments, and integration into larger software systems.
- Productionizing R is widely described as painful; common pattern is prototype in R, rewrite in another language.
- Others push back that R has serious quirks (non‑standard evaluation, indexing oddities, silent NA behaviors) and can be fragile for larger software.
Tables, dataframes, and language design
- A big subthread argues the real problem is that mainstream languages don’t treat tables/dataframes as first‑class citizens; instead users learn mini‑languages (pandas, dplyr, Polars).
- Suggestions and examples span SQL, q/kdb, Clojure, Rye, Lil, Nushell, APL, Matlab, Julia, Fortran, and Excel‑style tools.
- Some think SQL + tools like DuckDB are a cleaner core for tabular work, with Python or R around the edges; others prefer staying in a dataframe‑centric DSL.
Broader language comparisons and “good enough”
- Multiple commenters claim no current language is truly “great” for data science; Python and R are both compromises.
- Julia, Clojure, Kotlin, Nim, SAS, Matlab, and even shell pipelines are mentioned as promising or domain‑strong but lacking Python’s momentum.
- Common conclusion: Python isn’t the best for data science, but it’s “good enough” at nearly everything and wins on ubiquity, tooling, and ecosystem.