Classifying all of the pdfs on the internet
Scale of the PDF Corpus
- Many argue 8 TB is small relative to “all PDFs on the internet.”
- Comparisons: Libgen and Anna’s Archive are reported in the tens to hundreds of TB; one estimate for Google Scholar–indexed PDFs alone implies >50 TB.
- Several commenters with private collections report 10–40 TB+ of PDFs (scientific, manuals, magazines), suggesting total global volume is likely far larger, possibly petabytes when including private documents (invoices, contracts, scans).
Common Crawl and Dataset Limitations
- Common Crawl typically truncates PDFs at ~1 MiB; the SafeDocs dataset refetched untruncated versions for one snapshot.
- Some note that this still only covers the “open web,” excludes paywalled and private corpora, and likely misses many large image-heavy PDFs.
- One commenter points out that the article effectively works on 500k PDFs, not the full corpus, and likely on metadata/URLs rather than full content.
Right to Be Forgotten (RTBF) Debate
- RTBF is discussed tangentially: some see it as futile once data is online; others clarify that laws target specific service providers and search engines, not “the entire internet.”
- There is disagreement over whether RTBF aims only to prevent use/storage, or also to remove public searchability of past information.
- Concerns are raised about RTBF being used by scammers or convicted individuals to bury past misconduct.
Personal Archives and Copyright
- Multiple users describe large private PDF archives (scientific papers, manuals, historical magazines).
- Efforts to build public magazine repositories face copyright and DMCA risks; many historical issues are in legal limbo with unclear rights holders.
PDF Extraction, Partitioning, and Embeddings
- Several are more interested in techniques for robust PDF parsing and data extraction (especially tables) than in size debates.
- Tools like Aryn and others are mentioned for partitioning PDFs and converting tables into structured data.
- Commenters highlight that embeddings enable using standard statistical and ML techniques without complex NLP preprocessing.
Critiques of Article Framing
- Some see the title (“all PDFs on the internet”) as marketing overreach, given the limited corpus and reliance on URLs.
- Others still find the approach and visualizations valuable as a demonstration of embeddings-based classification at scale.