Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

Browser Engines & Headless Automation

  • Many recommend using a real browser engine for fidelity, especially with modern CSS/JS:
    • Headless Chrome/Chromium via CLI (--headless --print-to-pdf) is a common baseline.
    • Puppeteer and Playwright are repeatedly cited as “main” open‑source options. Easy to integrate in backends or microservices; good when JS must run and print CSS is under your control.
    • Gotenberg (Dockerized headless Chrome with a Go wrapper) is praised as “rock solid” in production, including high‑volume workloads and Word→PDF.
  • Downsides mentioned: heavier resource usage, headless browser quirks (especially Firefox), and long‑term maintenance overhead.

WeasyPrint & Non‑Browser Libraries

  • WeasyPrint gets strong endorsements for server‑side HTML→PDF, especially with Django and AWS Lambda:
    • Handles modern CSS reasonably well (but no JS).
    • Seen as a better‑maintained successor to wkhtmltopdf.
    • Used for invoices, ebooks, and general document export; considered close to commercial tools now.
  • Some prefer non‑browser engines (like WeasyPrint) for lower resource use and more predictable environments.

Pandoc, LaTeX & Typst

  • Pandoc is frequently named:
    • Converts HTML→LaTeX (or Typst)→PDF via engines like XeTeX or ConTeXt.
    • Valued for robustness across many formats and extensibility.
    • Debate: some dislike that it’s a “wrapper” around LaTeX and would rather target the underlying engine directly; others argue the higher‑level interface is worth it.
    • Also used to create self‑contained HTML instead of PDF; mixed reports on how well complex CSS survives.
  • Typst is suggested as a modern typesetting backend, sometimes via pandoc → typst → PDF.

Other Tools & Ecosystem

  • Java stack: openhtmltopdf / Flying Saucer, PDFBox, OpenPDF.
  • PHP: mPDF; JS frontend: jsPDF.
  • Ghostscript and Apache FOP are mentioned for more low‑level or XML‑driven workflows.
  • For PDF merging, poppler’s pdfunite is cited as a simple open‑source solution.

Philosophical & Workflow Points

  • Some argue to avoid HTML→PDF entirely: keep a canonical source (Markdown/LaTeX/etc.) and generate both HTML and PDF from that.
  • Others stress that PDFs are still preferred for archival, offline reading, annotations, and consistent layout; HTML is better for accessibility and reflow.
  • For the OP’s use case (≈5k PDFs/month for archival), the de facto “modern OSS” answers in the thread are:
    • Headless Chromium (directly, via Puppeteer/Playwright, or via Gotenberg), and
    • WeasyPrint, if JS execution isn’t required.