arXiv moving from Cornell servers to Google Cloud
Cloud migration and technical modernization
- arXiv is moving from Cornell-hosted VMs to Google Cloud, with a “Cloud Edition” plan: containerizing services, introducing Kubernetes/Cloud Run, asynchronous processing, better monitoring/logging, and replacing remaining Perl/PHP backend code.
- Some see this as a normal technical-debt cleanup and capacity upgrade, driven by increased load (especially from AI crawlers), growing submissions, and spam/AI-generated papers.
- Others argue they could have stayed on self-managed containers (e.g., k3s, Docker Swarm) or just used a CDN, and that k8s adds unnecessary complexity.
Cost, vendor lock-in, and corporate influence
- Multiple comments worry about:
- Cloud bills ballooning beyond the roughly $88k/year arXiv previously budgeted for servers.
- Gradual dependence on provider-specific services making “moving back” infeasible once old infrastructure and knowledge decay.
- “Capitalist capture of the commons” and another public-good service becoming dependent on a mega-corporation.
- It’s noted that Google is a gold sponsor; some suspect GCP credits and co-marketing as part of the deal. Opinions differ on whether that’s benign sponsorship or a risky subsidy.
Privacy, control, and censorship
- Some fear reduced privacy: Google could observe who reads which papers, though others point out Google already sees much via search and tracking.
- A strong thread focuses on sanctions and access: experiences are shared of GCP traffic to Iran being silently dropped; dispute exists over whether Google or Iran is doing the blocking, but historically GCP has blocked sanctioned countries for some services.
- Commenters argue this will likely worsen access for users in Iran and similar countries, with debate over whether such blocking is “ideal” or a loss for global science.
arXiv as public infrastructure and alternatives
- Several see arXiv as de facto public/scientific infrastructure and would prefer:
- A consortium of international academic libraries or a nonprofit governance model.
- Federated or distributed architectures where anyone can mirror/clone the corpus; centralized operators become curators rather than single points of failure.
- Others counter that networking, power, and hosting have long depended on corporations, and that using GCP doesn’t automatically hand control of content to Google.
UI, tooling, and hiring side threads
- UX opinions split: some want a modernized interface; many like the lean, “ASCII-style” UI for speed and clarity.
- There’s discussion of the Perl+LaTeX-heavy stack and confirmation that LaTeX remains dominant in many math-heavy fields, often via tools like Overleaf.
- Commenters note US-only hiring, Cornell’s hiring pause, and debate remote/on-call models, but it’s unclear how much these constraints will delay the migration.