2025-09-28

Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

Browser Engines & Headless Automation

Many recommend using a real browser engine for fidelity, especially with modern CSS/JS:
- Headless Chrome/Chromium via CLI (--headless --print-to-pdf) is a common baseline.
- Puppeteer and Playwright are repeatedly cited as “main” open‑source options. Easy to integrate in backends or microservices; good when JS must run and print CSS is under your control.
- Gotenberg (Dockerized headless Chrome with a Go wrapper) is praised as “rock solid” in production, including high‑volume workloads and Word→PDF.
Downsides mentioned: heavier resource usage, headless browser quirks (especially Firefox), and long‑term maintenance overhead.

WeasyPrint & Non‑Browser Libraries

WeasyPrint gets strong endorsements for server‑side HTML→PDF, especially with Django and AWS Lambda:
- Handles modern CSS reasonably well (but no JS).
- Seen as a better‑maintained successor to wkhtmltopdf.
- Used for invoices, ebooks, and general document export; considered close to commercial tools now.
Some prefer non‑browser engines (like WeasyPrint) for lower resource use and more predictable environments.

Pandoc, LaTeX & Typst

Pandoc is frequently named:
- Converts HTML→LaTeX (or Typst)→PDF via engines like XeTeX or ConTeXt.
- Valued for robustness across many formats and extensibility.
- Debate: some dislike that it’s a “wrapper” around LaTeX and would rather target the underlying engine directly; others argue the higher‑level interface is worth it.
- Also used to create self‑contained HTML instead of PDF; mixed reports on how well complex CSS survives.
Typst is suggested as a modern typesetting backend, sometimes via pandoc → typst → PDF.

Other Tools & Ecosystem

Java stack: openhtmltopdf / Flying Saucer, PDFBox, OpenPDF.
PHP: mPDF; JS frontend: jsPDF.
Ghostscript and Apache FOP are mentioned for more low‑level or XML‑driven workflows.
For PDF merging, poppler’s pdfunite is cited as a simple open‑source solution.

Philosophical & Workflow Points

Some argue to avoid HTML→PDF entirely: keep a canonical source (Markdown/LaTeX/etc.) and generate both HTML and PDF from that.
Others stress that PDFs are still preferred for archival, offline reading, annotations, and consistent layout; HTML is better for accessibility and reflow.
For the OP’s use case (≈5k PDFs/month for archival), the de facto “modern OSS” answers in the thread are:
- Headless Chromium (directly, via Puppeteer/Playwright, or via Gotenberg), and
- WeasyPrint, if JS execution isn’t required.

Related topics