2024-06-19

Open Source Python ETL

Product & Feature Overview

Amphi is presented as a low-code Python ETL tool focused on both structured and unstructured data.
Main use cases: file integration, data prep, migrations, and AI/RAG pipelines.
Distinguishing pitch: drag‑and‑drop GUI that generates plain Python (pandas-based) code and JSON pipeline definitions, which users can own and deploy anywhere.
Available as a standalone web app and a JupyterLab extension; leverages Jupyter’s existing ecosystem (e.g., Git, S3 file systems).
Currently supports pre-built input components; custom inputs are planned.

Licensing and “Open Source” Debate

Code is on GitHub under Elastic License v2 (ELv2).
Several commenters stress this is “source available,” not OSI-compliant open source.
Some view the “open source” labeling (including the HN title) as misleading or promotional; others argue everyday usage of “open source” is looser and see complaints as pedantic.
There is agreement that, under formal definitions, it should not be called open source.

Comparisons to Existing Tools

Compared to Alteryx, Informatica, Talend, Pentaho, SSIS, Azure Data Factory, Nifi, Elyra, dbt, Airflow, Dagster, Prefect, Windmill, Fivetran, dlt, Meltano, Databricks Lakeflow.
Amphi is framed as:
- More low‑code/graphical than Dagster/Prefect/dbt.
- More Python-focused than traditional Java/enterprise GUI ETLs.
- More transformation/file/AI-oriented than Fivetran‑style ingestion tools.

Low‑Code vs Code‑First ETL

Some see low‑code as a regression after the “ETL as code” shift (Airflow/Luigi, etc.), citing:
- Poor modularity, observability, scalability, and vendor lock‑in in GUI tools.
- Difficulty versioning, testing, and applying CI/CD.
Others argue:
- Visual DAGs make complex flows easier to understand at a glance.
- Low‑code boosts productivity for small teams and democratizes ETL for less technical users.
- Both approaches will coexist, depending on team skills and use cases.

Architecture, Scaling, and Performance

Amphi generates pandas code, with optional scaling via Modin (including Dask backends); future plans include Spark and Snowflake support.
Infrastructure orchestration (multi-machine, clusters) is largely manual at this stage.
Concerns raised that pandas-centric ETL is more memory‑heavy and less efficient than SQL; the author references a write‑up justifying pandas in some contexts.
A “Python ETL” label is questioned given that the repo is mostly TypeScript on the frontend.

Self‑Serve Data Work and Skills

Debate over whether enabling non‑CS staff to build ETLs is beneficial:
- Critics worry about data quality issues, fragile pipelines, and lack of engineering rigor.
- Others emphasize training, mentoring, and policy rather than gatekeeping; self‑serve is seen as useful for simple needs, with experts stepping in for complex cases.
Several anecdotes describe failed “self‑serve BI/ETL” initiatives where business users ultimately relied on engineers anyway.

Related topics