arXiv moving from Cornell servers to Google Cloud

Cloud migration and technical modernization

  • arXiv is moving from Cornell-hosted VMs to Google Cloud, with a “Cloud Edition” plan: containerizing services, introducing Kubernetes/Cloud Run, asynchronous processing, better monitoring/logging, and replacing remaining Perl/PHP backend code.
  • Some see this as a normal technical-debt cleanup and capacity upgrade, driven by increased load (especially from AI crawlers), growing submissions, and spam/AI-generated papers.
  • Others argue they could have stayed on self-managed containers (e.g., k3s, Docker Swarm) or just used a CDN, and that k8s adds unnecessary complexity.

Cost, vendor lock-in, and corporate influence

  • Multiple comments worry about:
    • Cloud bills ballooning beyond the roughly $88k/year arXiv previously budgeted for servers.
    • Gradual dependence on provider-specific services making “moving back” infeasible once old infrastructure and knowledge decay.
    • “Capitalist capture of the commons” and another public-good service becoming dependent on a mega-corporation.
  • It’s noted that Google is a gold sponsor; some suspect GCP credits and co-marketing as part of the deal. Opinions differ on whether that’s benign sponsorship or a risky subsidy.

Privacy, control, and censorship

  • Some fear reduced privacy: Google could observe who reads which papers, though others point out Google already sees much via search and tracking.
  • A strong thread focuses on sanctions and access: experiences are shared of GCP traffic to Iran being silently dropped; dispute exists over whether Google or Iran is doing the blocking, but historically GCP has blocked sanctioned countries for some services.
  • Commenters argue this will likely worsen access for users in Iran and similar countries, with debate over whether such blocking is “ideal” or a loss for global science.

arXiv as public infrastructure and alternatives

  • Several see arXiv as de facto public/scientific infrastructure and would prefer:
    • A consortium of international academic libraries or a nonprofit governance model.
    • Federated or distributed architectures where anyone can mirror/clone the corpus; centralized operators become curators rather than single points of failure.
  • Others counter that networking, power, and hosting have long depended on corporations, and that using GCP doesn’t automatically hand control of content to Google.

UI, tooling, and hiring side threads

  • UX opinions split: some want a modernized interface; many like the lean, “ASCII-style” UI for speed and clarity.
  • There’s discussion of the Perl+LaTeX-heavy stack and confirmation that LaTeX remains dominant in many math-heavy fields, often via tools like Overleaf.
  • Commenters note US-only hiring, Cornell’s hiring pause, and debate remote/on-call models, but it’s unclear how much these constraints will delay the migration.