Open Source Python ETL

Product & Feature Overview

  • Amphi is presented as a low-code Python ETL tool focused on both structured and unstructured data.
  • Main use cases: file integration, data prep, migrations, and AI/RAG pipelines.
  • Distinguishing pitch: drag‑and‑drop GUI that generates plain Python (pandas-based) code and JSON pipeline definitions, which users can own and deploy anywhere.
  • Available as a standalone web app and a JupyterLab extension; leverages Jupyter’s existing ecosystem (e.g., Git, S3 file systems).
  • Currently supports pre-built input components; custom inputs are planned.

Licensing and “Open Source” Debate

  • Code is on GitHub under Elastic License v2 (ELv2).
  • Several commenters stress this is “source available,” not OSI-compliant open source.
  • Some view the “open source” labeling (including the HN title) as misleading or promotional; others argue everyday usage of “open source” is looser and see complaints as pedantic.
  • There is agreement that, under formal definitions, it should not be called open source.

Comparisons to Existing Tools

  • Compared to Alteryx, Informatica, Talend, Pentaho, SSIS, Azure Data Factory, Nifi, Elyra, dbt, Airflow, Dagster, Prefect, Windmill, Fivetran, dlt, Meltano, Databricks Lakeflow.
  • Amphi is framed as:
    • More low‑code/graphical than Dagster/Prefect/dbt.
    • More Python-focused than traditional Java/enterprise GUI ETLs.
    • More transformation/file/AI-oriented than Fivetran‑style ingestion tools.

Low‑Code vs Code‑First ETL

  • Some see low‑code as a regression after the “ETL as code” shift (Airflow/Luigi, etc.), citing:
    • Poor modularity, observability, scalability, and vendor lock‑in in GUI tools.
    • Difficulty versioning, testing, and applying CI/CD.
  • Others argue:
    • Visual DAGs make complex flows easier to understand at a glance.
    • Low‑code boosts productivity for small teams and democratizes ETL for less technical users.
    • Both approaches will coexist, depending on team skills and use cases.

Architecture, Scaling, and Performance

  • Amphi generates pandas code, with optional scaling via Modin (including Dask backends); future plans include Spark and Snowflake support.
  • Infrastructure orchestration (multi-machine, clusters) is largely manual at this stage.
  • Concerns raised that pandas-centric ETL is more memory‑heavy and less efficient than SQL; the author references a write‑up justifying pandas in some contexts.
  • A “Python ETL” label is questioned given that the repo is mostly TypeScript on the frontend.

Self‑Serve Data Work and Skills

  • Debate over whether enabling non‑CS staff to build ETLs is beneficial:
    • Critics worry about data quality issues, fragile pipelines, and lack of engineering rigor.
    • Others emphasize training, mentoring, and policy rather than gatekeeping; self‑serve is seen as useful for simple needs, with experts stepping in for complex cases.
  • Several anecdotes describe failed “self‑serve BI/ETL” initiatives where business users ultimately relied on engineers anyway.