2024-08-19

Classifying all of the pdfs on the internet

Scale of the PDF Corpus

Many argue 8 TB is small relative to “all PDFs on the internet.”
Comparisons: Libgen and Anna’s Archive are reported in the tens to hundreds of TB; one estimate for Google Scholar–indexed PDFs alone implies >50 TB.
Several commenters with private collections report 10–40 TB+ of PDFs (scientific, manuals, magazines), suggesting total global volume is likely far larger, possibly petabytes when including private documents (invoices, contracts, scans).

Common Crawl and Dataset Limitations

Common Crawl typically truncates PDFs at ~1 MiB; the SafeDocs dataset refetched untruncated versions for one snapshot.
Some note that this still only covers the “open web,” excludes paywalled and private corpora, and likely misses many large image-heavy PDFs.
One commenter points out that the article effectively works on 500k PDFs, not the full corpus, and likely on metadata/URLs rather than full content.

Right to Be Forgotten (RTBF) Debate

RTBF is discussed tangentially: some see it as futile once data is online; others clarify that laws target specific service providers and search engines, not “the entire internet.”
There is disagreement over whether RTBF aims only to prevent use/storage, or also to remove public searchability of past information.
Concerns are raised about RTBF being used by scammers or convicted individuals to bury past misconduct.

Personal Archives and Copyright

Multiple users describe large private PDF archives (scientific papers, manuals, historical magazines).
Efforts to build public magazine repositories face copyright and DMCA risks; many historical issues are in legal limbo with unclear rights holders.

PDF Extraction, Partitioning, and Embeddings

Several are more interested in techniques for robust PDF parsing and data extraction (especially tables) than in size debates.
Tools like Aryn and others are mentioned for partitioning PDFs and converting tables into structured data.
Commenters highlight that embeddings enable using standard statistical and ML techniques without complex NLP preprocessing.

Critiques of Article Framing

Some see the title (“all PDFs on the internet”) as marketing overreach, given the limited corpus and reliance on URLs.
Others still find the approach and visualizations valuable as a demonstration of embeddings-based classification at scale.

Related topics