Ask HN: What is nowadays (opensource) way of converting HTML to PDF?
Browser Engines & Headless Automation
- Many recommend using a real browser engine for fidelity, especially with modern CSS/JS:
- Headless Chrome/Chromium via CLI (
--headless --print-to-pdf) is a common baseline. - Puppeteer and Playwright are repeatedly cited as “main” open‑source options. Easy to integrate in backends or microservices; good when JS must run and print CSS is under your control.
- Gotenberg (Dockerized headless Chrome with a Go wrapper) is praised as “rock solid” in production, including high‑volume workloads and Word→PDF.
- Headless Chrome/Chromium via CLI (
- Downsides mentioned: heavier resource usage, headless browser quirks (especially Firefox), and long‑term maintenance overhead.
WeasyPrint & Non‑Browser Libraries
- WeasyPrint gets strong endorsements for server‑side HTML→PDF, especially with Django and AWS Lambda:
- Handles modern CSS reasonably well (but no JS).
- Seen as a better‑maintained successor to wkhtmltopdf.
- Used for invoices, ebooks, and general document export; considered close to commercial tools now.
- Some prefer non‑browser engines (like WeasyPrint) for lower resource use and more predictable environments.
Pandoc, LaTeX & Typst
- Pandoc is frequently named:
- Converts HTML→LaTeX (or Typst)→PDF via engines like XeTeX or ConTeXt.
- Valued for robustness across many formats and extensibility.
- Debate: some dislike that it’s a “wrapper” around LaTeX and would rather target the underlying engine directly; others argue the higher‑level interface is worth it.
- Also used to create self‑contained HTML instead of PDF; mixed reports on how well complex CSS survives.
- Typst is suggested as a modern typesetting backend, sometimes via
pandoc → typst → PDF.
Other Tools & Ecosystem
- Java stack: openhtmltopdf / Flying Saucer, PDFBox, OpenPDF.
- PHP: mPDF; JS frontend: jsPDF.
- Ghostscript and Apache FOP are mentioned for more low‑level or XML‑driven workflows.
- For PDF merging, poppler’s
pdfuniteis cited as a simple open‑source solution.
Philosophical & Workflow Points
- Some argue to avoid HTML→PDF entirely: keep a canonical source (Markdown/LaTeX/etc.) and generate both HTML and PDF from that.
- Others stress that PDFs are still preferred for archival, offline reading, annotations, and consistent layout; HTML is better for accessibility and reflow.
- For the OP’s use case (≈5k PDFs/month for archival), the de facto “modern OSS” answers in the thread are:
- Headless Chromium (directly, via Puppeteer/Playwright, or via Gotenberg), and
- WeasyPrint, if JS execution isn’t required.