Classifying all of the pdfs on the internet

Scale of the PDF Corpus

  • Many argue 8 TB is small relative to “all PDFs on the internet.”
  • Comparisons: Libgen and Anna’s Archive are reported in the tens to hundreds of TB; one estimate for Google Scholar–indexed PDFs alone implies >50 TB.
  • Several commenters with private collections report 10–40 TB+ of PDFs (scientific, manuals, magazines), suggesting total global volume is likely far larger, possibly petabytes when including private documents (invoices, contracts, scans).

Common Crawl and Dataset Limitations

  • Common Crawl typically truncates PDFs at ~1 MiB; the SafeDocs dataset refetched untruncated versions for one snapshot.
  • Some note that this still only covers the “open web,” excludes paywalled and private corpora, and likely misses many large image-heavy PDFs.
  • One commenter points out that the article effectively works on 500k PDFs, not the full corpus, and likely on metadata/URLs rather than full content.

Right to Be Forgotten (RTBF) Debate

  • RTBF is discussed tangentially: some see it as futile once data is online; others clarify that laws target specific service providers and search engines, not “the entire internet.”
  • There is disagreement over whether RTBF aims only to prevent use/storage, or also to remove public searchability of past information.
  • Concerns are raised about RTBF being used by scammers or convicted individuals to bury past misconduct.

Personal Archives and Copyright

  • Multiple users describe large private PDF archives (scientific papers, manuals, historical magazines).
  • Efforts to build public magazine repositories face copyright and DMCA risks; many historical issues are in legal limbo with unclear rights holders.

PDF Extraction, Partitioning, and Embeddings

  • Several are more interested in techniques for robust PDF parsing and data extraction (especially tables) than in size debates.
  • Tools like Aryn and others are mentioned for partitioning PDFs and converting tables into structured data.
  • Commenters highlight that embeddings enable using standard statistical and ML techniques without complex NLP preprocessing.

Critiques of Article Framing

  • Some see the title (“all PDFs on the internet”) as marketing overreach, given the limited corpus and reliance on URLs.
  • Others still find the approach and visualizations valuable as a demonstration of embeddings-based classification at scale.