2025-12-31

Show HN: Use Claude Code to Query 600 GB Indexes over Hacker News, ArXiv, etc.

Product concept & appeal

Tool lets users query large, multi-source text corpora (HN, arXiv, LessWrong, PubMed in progress, etc.) via LLM-generated SQL plus vector search.
Many commenters like the “LLM as query generator” model instead of opaque chatbot answers; it’s seen as a natural-language → rigid query translator.
People highlight its potential for deep research, exploratory analysis, and discovering hidden patterns in public datasets.

Open source, keys, and funding

Several ask for open-sourcing, both for trust (not wanting to share third-party API keys) and integration into their own research systems.
The author repeatedly cites personal financial constraints and server/API costs as the main blocker to open-sourcing and full embedding coverage.
Some suggest a standard path: open-source core + hosted SaaS, raising angels, or applying to accelerators.

Technical design: SQL + embeddings

Under the hood: Voyage embeddings, paragraph/sentence chunking, SQL + lexical search + vector search, with some rate-limiting and AST-based query controls.
There’s discussion of semantic drift across domains (“optimization” in arXiv vs LessWrong vs HN) and how higher-quality embeddings and centroid compositions can help.
One commenter questions the “vector algebra” framing (@X + @Y − @Z), arguing embeddings don’t form a true algebraic structure; the author replies that this is mainly a practical, intuitive exploration tool, not a formal guarantee.

Scale, “state-of-the-art,” and marketing

Supporters emphasize scale (full-text arXiv and many public corpora in one DB) and freedom to run arbitrary SELECTs plus vector ops as differentiators.
Critics challenge the “state-of-the-art” and “intelligence explosion” language as marketing hyperbole and “charlatan-ish,” arguing the term is unprotected and overused.
The author defends the claim by pointing to capabilities (agentic text-to-SQL workflows, multi-source embeddings), not formal benchmarks.

Models, cost, and local vs hosted

Some don’t like burning paid Claude credits and ask for local LLaMA/Qwen support; others reply it’s “just a prompt” and any capable model could drive it, though quality differs.
One defender notes that if users won’t pay for their own LLM usage, that’s their choice, but not a problem with the tool itself.

Security and sandboxing

Multiple comments warn about suggesting powerful flags or untrusted code execution without sandboxing; devcontainers and dedicated Claude sandboxes are discussed as minimum protections.
Concerns also raised about network egress and trusting a non-established domain with such access.

Use cases and user reports

People propose applications in autonomous academic agents, biomedical supplementary materials, string theory landscape searches, and watchdog uses (e.g., leaks data).
A long report from one user/agent describes successfully building structured research corpora, discovering relevant prior work, and practical notes on latency and result limits.

Broader AI / AGI / Turing-test tangent

Thread detours into what counts as AGI, “intelligence explosion,” and the Turing test:
- Some argue current LLMs would have been seen as AGI by older definitions; others strongly disagree, insisting AGI implies human-level generality or sentience.
- There’s debate over whether recent advances constitute “intelligence explosion” or just efficiency improvements.
- Several note that public and pop-culture notions of AGI (sentient, goal-directed agents) don’t match today’s prompt-bound models.

Related topics