2025-12-06

HTML as an Accessible Format for Papers (2023)

Status of arXiv HTML and “experimental” label

Commenters note HTML versions have existed for years; the linked page is from 2023 and some think the “experimental” label is overstaying its welcome.
Coverage is incomplete, especially for older papers; some users wish for a “try HTML anyway” button.
ar5iv is highlighted as an unofficial mirror using similar tech with a one‑month lag, plus the defunct arxiv‑vanity as a predecessor.
An arXiv HTML developer explains the main bottleneck is developer time and asks users to report rendering issues via GitHub; LaTeXML is the core converter.

Technical and authoring challenges (TeX → HTML)

90% of submissions are TeX/LaTeX; converting a Turing-complete macro language to robust HTML at scale is described as uniquely hard.
Users report frequent layout issues in HTML (figure sizing, column widths) and more consistent layout in PDFs.
Some authors say HTML conversion forces them to add fallback macros and increases their workload; local simulation of arXiv’s pipeline is difficult, though a Docker image exists.
LaTeXML’s approach (TeX → semantic XML → HTML via XSLT) is mentioned; documentation is seen as a barrier to contributors.

Accessibility and reading experience: HTML vs PDF vs EPUB

Strong support for HTML on accessibility grounds: better with screen readers, easier text extraction, and integration with browser extensions (translation, notes, LLM tools).
Others defend PDF for high‑fidelity print and “author’s intended layout,” especially when seriously studying papers. Some note a generational split: print vs multi‑monitor/tablet reading.
HTML is praised as inherently more accessible than PDF, but only if semantic tags (figure/figcaption, headings, citations) are used instead of “a sea of divs.”
EPUB is suggested as ideal for e‑readers; it’s essentially packaged HTML but lacks strong, portable annotation tooling.

What should be the canonical format?

One camp argues HTML is sufficient as a structural format (semantic HTML + CSS), and “perfect is the enemy of good.”
Another wants a pure structural format (abstract, authors, sections, equations) separate from any rendering, with HTML/PDF as views. XML+XSLT or custom HTML elements are proposed.
Markdown is proposed and dismissed as less machine‑readable than HTML and weak for complex figures/tables.
Several say nobody wants to author directly in HTML; the real need is a single high-level markup that targets both PDF and HTML (Typst is cited as a promising but immature example).

Machine-readability, LLMs, and future directions

Some speculate HTML support is partly motivated by feeding papers into LLMs; others reply modern models already handle PDFs well.
One view: as multimodal LLMs improve, file format will matter less because models can “visually understand” PDFs/PNGs and re-express them as summaries, databases, or audiobooks.
Others consider this dystopian if LLMs become the primary “front end” to research, given hallucinations and subtlety loss. A medical-data anecdote stresses the need for guaranteed correctness, arguing better native formats (not PDFs) still matter.

Math, Unicode, and TeX in browsers (tangents)

Some wish Unicode and modern font tech had been extended for richer math so plain text could replace TeX+PDF; others argue math layout (fractions, scalable parentheses) is fundamentally beyond Unicode’s scope and better handled by MathML/TeX.
Suggestions to render TeX directly in browsers or via SVG are critiqued as losing semantic structure, undermining accessibility goals.

Incentives and conservatism in publishing

Multiple comments note that authors mainly follow journal templates; entrenched LaTeX/PDF workflows and expectations (two-column layouts, traditional appearance) slow adoption of more accessible, responsive formats.

Related topics