Open Source Python ETL
Product & Feature Overview
- Amphi is presented as a low-code Python ETL tool focused on both structured and unstructured data.
- Main use cases: file integration, data prep, migrations, and AI/RAG pipelines.
- Distinguishing pitch: drag‑and‑drop GUI that generates plain Python (pandas-based) code and JSON pipeline definitions, which users can own and deploy anywhere.
- Available as a standalone web app and a JupyterLab extension; leverages Jupyter’s existing ecosystem (e.g., Git, S3 file systems).
- Currently supports pre-built input components; custom inputs are planned.
Licensing and “Open Source” Debate
- Code is on GitHub under Elastic License v2 (ELv2).
- Several commenters stress this is “source available,” not OSI-compliant open source.
- Some view the “open source” labeling (including the HN title) as misleading or promotional; others argue everyday usage of “open source” is looser and see complaints as pedantic.
- There is agreement that, under formal definitions, it should not be called open source.
Comparisons to Existing Tools
- Compared to Alteryx, Informatica, Talend, Pentaho, SSIS, Azure Data Factory, Nifi, Elyra, dbt, Airflow, Dagster, Prefect, Windmill, Fivetran, dlt, Meltano, Databricks Lakeflow.
- Amphi is framed as:
- More low‑code/graphical than Dagster/Prefect/dbt.
- More Python-focused than traditional Java/enterprise GUI ETLs.
- More transformation/file/AI-oriented than Fivetran‑style ingestion tools.
Low‑Code vs Code‑First ETL
- Some see low‑code as a regression after the “ETL as code” shift (Airflow/Luigi, etc.), citing:
- Poor modularity, observability, scalability, and vendor lock‑in in GUI tools.
- Difficulty versioning, testing, and applying CI/CD.
- Others argue:
- Visual DAGs make complex flows easier to understand at a glance.
- Low‑code boosts productivity for small teams and democratizes ETL for less technical users.
- Both approaches will coexist, depending on team skills and use cases.
Architecture, Scaling, and Performance
- Amphi generates pandas code, with optional scaling via Modin (including Dask backends); future plans include Spark and Snowflake support.
- Infrastructure orchestration (multi-machine, clusters) is largely manual at this stage.
- Concerns raised that pandas-centric ETL is more memory‑heavy and less efficient than SQL; the author references a write‑up justifying pandas in some contexts.
- A “Python ETL” label is questioned given that the repo is mostly TypeScript on the frontend.
Self‑Serve Data Work and Skills
- Debate over whether enabling non‑CS staff to build ETLs is beneficial:
- Critics worry about data quality issues, fragile pipelines, and lack of engineering rigor.
- Others emphasize training, mentoring, and policy rather than gatekeeping; self‑serve is seen as useful for simple needs, with experts stepping in for complex cases.
- Several anecdotes describe failed “self‑serve BI/ETL” initiatives where business users ultimately relied on engineers anyway.