Show HN: Use Claude Code to Query 600 GB Indexes over Hacker News, ArXiv, etc.
Product concept & appeal
- Tool lets users query large, multi-source text corpora (HN, arXiv, LessWrong, PubMed in progress, etc.) via LLM-generated SQL plus vector search.
- Many commenters like the “LLM as query generator” model instead of opaque chatbot answers; it’s seen as a natural-language → rigid query translator.
- People highlight its potential for deep research, exploratory analysis, and discovering hidden patterns in public datasets.
Open source, keys, and funding
- Several ask for open-sourcing, both for trust (not wanting to share third-party API keys) and integration into their own research systems.
- The author repeatedly cites personal financial constraints and server/API costs as the main blocker to open-sourcing and full embedding coverage.
- Some suggest a standard path: open-source core + hosted SaaS, raising angels, or applying to accelerators.
Technical design: SQL + embeddings
- Under the hood: Voyage embeddings, paragraph/sentence chunking, SQL + lexical search + vector search, with some rate-limiting and AST-based query controls.
- There’s discussion of semantic drift across domains (“optimization” in arXiv vs LessWrong vs HN) and how higher-quality embeddings and centroid compositions can help.
- One commenter questions the “vector algebra” framing (@X + @Y − @Z), arguing embeddings don’t form a true algebraic structure; the author replies that this is mainly a practical, intuitive exploration tool, not a formal guarantee.
Scale, “state-of-the-art,” and marketing
- Supporters emphasize scale (full-text arXiv and many public corpora in one DB) and freedom to run arbitrary SELECTs plus vector ops as differentiators.
- Critics challenge the “state-of-the-art” and “intelligence explosion” language as marketing hyperbole and “charlatan-ish,” arguing the term is unprotected and overused.
- The author defends the claim by pointing to capabilities (agentic text-to-SQL workflows, multi-source embeddings), not formal benchmarks.
Models, cost, and local vs hosted
- Some don’t like burning paid Claude credits and ask for local LLaMA/Qwen support; others reply it’s “just a prompt” and any capable model could drive it, though quality differs.
- One defender notes that if users won’t pay for their own LLM usage, that’s their choice, but not a problem with the tool itself.
Security and sandboxing
- Multiple comments warn about suggesting powerful flags or untrusted code execution without sandboxing; devcontainers and dedicated Claude sandboxes are discussed as minimum protections.
- Concerns also raised about network egress and trusting a non-established domain with such access.
Use cases and user reports
- People propose applications in autonomous academic agents, biomedical supplementary materials, string theory landscape searches, and watchdog uses (e.g., leaks data).
- A long report from one user/agent describes successfully building structured research corpora, discovering relevant prior work, and practical notes on latency and result limits.
Broader AI / AGI / Turing-test tangent
- Thread detours into what counts as AGI, “intelligence explosion,” and the Turing test:
- Some argue current LLMs would have been seen as AGI by older definitions; others strongly disagree, insisting AGI implies human-level generality or sentience.
- There’s debate over whether recent advances constitute “intelligence explosion” or just efficiency improvements.
- Several note that public and pop-culture notions of AGI (sentient, goal-directed agents) don’t match today’s prompt-bound models.