2025-04-18

arXiv moving from Cornell servers to Google Cloud

Cloud migration and technical modernization

arXiv is moving from Cornell-hosted VMs to Google Cloud, with a “Cloud Edition” plan: containerizing services, introducing Kubernetes/Cloud Run, asynchronous processing, better monitoring/logging, and replacing remaining Perl/PHP backend code.
Some see this as a normal technical-debt cleanup and capacity upgrade, driven by increased load (especially from AI crawlers), growing submissions, and spam/AI-generated papers.
Others argue they could have stayed on self-managed containers (e.g., k3s, Docker Swarm) or just used a CDN, and that k8s adds unnecessary complexity.

Cost, vendor lock-in, and corporate influence

Multiple comments worry about:
- Cloud bills ballooning beyond the roughly $88k/year arXiv previously budgeted for servers.
- Gradual dependence on provider-specific services making “moving back” infeasible once old infrastructure and knowledge decay.
- “Capitalist capture of the commons” and another public-good service becoming dependent on a mega-corporation.
It’s noted that Google is a gold sponsor; some suspect GCP credits and co-marketing as part of the deal. Opinions differ on whether that’s benign sponsorship or a risky subsidy.

Privacy, control, and censorship

Some fear reduced privacy: Google could observe who reads which papers, though others point out Google already sees much via search and tracking.
A strong thread focuses on sanctions and access: experiences are shared of GCP traffic to Iran being silently dropped; dispute exists over whether Google or Iran is doing the blocking, but historically GCP has blocked sanctioned countries for some services.
Commenters argue this will likely worsen access for users in Iran and similar countries, with debate over whether such blocking is “ideal” or a loss for global science.

arXiv as public infrastructure and alternatives

Several see arXiv as de facto public/scientific infrastructure and would prefer:
- A consortium of international academic libraries or a nonprofit governance model.
- Federated or distributed architectures where anyone can mirror/clone the corpus; centralized operators become curators rather than single points of failure.
Others counter that networking, power, and hosting have long depended on corporations, and that using GCP doesn’t automatically hand control of content to Google.

UI, tooling, and hiring side threads

UX opinions split: some want a modernized interface; many like the lean, “ASCII-style” UI for speed and clarity.
There’s discussion of the Perl+LaTeX-heavy stack and confirmation that LaTeX remains dominant in many math-heavy fields, often via tools like Overleaf.
Commenters note US-only hiring, Cornell’s hiring pause, and debate remote/on-call models, but it’s unclear how much these constraints will delay the migration.

Related topics