HTML as an Accessible Format for Papers (2023)
Status of arXiv HTML and “experimental” label
- Commenters note HTML versions have existed for years; the linked page is from 2023 and some think the “experimental” label is overstaying its welcome.
- Coverage is incomplete, especially for older papers; some users wish for a “try HTML anyway” button.
- ar5iv is highlighted as an unofficial mirror using similar tech with a one‑month lag, plus the defunct arxiv‑vanity as a predecessor.
- An arXiv HTML developer explains the main bottleneck is developer time and asks users to report rendering issues via GitHub; LaTeXML is the core converter.
Technical and authoring challenges (TeX → HTML)
- 90% of submissions are TeX/LaTeX; converting a Turing-complete macro language to robust HTML at scale is described as uniquely hard.
- Users report frequent layout issues in HTML (figure sizing, column widths) and more consistent layout in PDFs.
- Some authors say HTML conversion forces them to add fallback macros and increases their workload; local simulation of arXiv’s pipeline is difficult, though a Docker image exists.
- LaTeXML’s approach (TeX → semantic XML → HTML via XSLT) is mentioned; documentation is seen as a barrier to contributors.
Accessibility and reading experience: HTML vs PDF vs EPUB
- Strong support for HTML on accessibility grounds: better with screen readers, easier text extraction, and integration with browser extensions (translation, notes, LLM tools).
- Others defend PDF for high‑fidelity print and “author’s intended layout,” especially when seriously studying papers. Some note a generational split: print vs multi‑monitor/tablet reading.
- HTML is praised as inherently more accessible than PDF, but only if semantic tags (figure/figcaption, headings, citations) are used instead of “a sea of divs.”
- EPUB is suggested as ideal for e‑readers; it’s essentially packaged HTML but lacks strong, portable annotation tooling.
What should be the canonical format?
- One camp argues HTML is sufficient as a structural format (semantic HTML + CSS), and “perfect is the enemy of good.”
- Another wants a pure structural format (abstract, authors, sections, equations) separate from any rendering, with HTML/PDF as views. XML+XSLT or custom HTML elements are proposed.
- Markdown is proposed and dismissed as less machine‑readable than HTML and weak for complex figures/tables.
- Several say nobody wants to author directly in HTML; the real need is a single high-level markup that targets both PDF and HTML (Typst is cited as a promising but immature example).
Machine-readability, LLMs, and future directions
- Some speculate HTML support is partly motivated by feeding papers into LLMs; others reply modern models already handle PDFs well.
- One view: as multimodal LLMs improve, file format will matter less because models can “visually understand” PDFs/PNGs and re-express them as summaries, databases, or audiobooks.
- Others consider this dystopian if LLMs become the primary “front end” to research, given hallucinations and subtlety loss. A medical-data anecdote stresses the need for guaranteed correctness, arguing better native formats (not PDFs) still matter.
Math, Unicode, and TeX in browsers (tangents)
- Some wish Unicode and modern font tech had been extended for richer math so plain text could replace TeX+PDF; others argue math layout (fractions, scalable parentheses) is fundamentally beyond Unicode’s scope and better handled by MathML/TeX.
- Suggestions to render TeX directly in browsers or via SVG are critiqued as losing semantic structure, undermining accessibility goals.
Incentives and conservatism in publishing
- Multiple comments note that authors mainly follow journal templates; entrenched LaTeX/PDF workflows and expectations (two-column layouts, traditional appearance) slow adoption of more accessible, responsive formats.