Show HN: I made a website to semantically search ArXiv papers
Project scope and related tools
- Site provides semantic search over arXiv, with companion versions for bioRxiv and medRxiv (not yet fully synchronized).
- Compared to tools like Semantic Scholar, arxivxplorer, OpenAlex-based systems, Research Rabbit, emergentmind, and academic search / workflow tools (undermind, scite, elicit, paper-qa, txtai, paperai, paperetl, Semantra).
- Some suggest integrating with external tools (e.g., paper-qa, OpenReview) and drawing on existing arXiv embeddings datasets.
Technical implementation and performance
- Uses MixedBread embeddings, chosen for small size, strong leaderboard performance, binary and matryoshka support.
- Embeddings are binarized and stored in Milvus; binary Hamming search yields large latency improvements (~hundreds of ms).
- Acknowledged tradeoff: top ~10 results are similar to full-precision search, but quality drops quickly beyond that.
- Suggestions include reranking a larger Hamming-retrieved candidate set with full-precision scores, using shorter embeddings instead of binarization, trying brute-force CPU search with SIMD, and exploring hybrid (keyword + vector) retrieval.
- Weekly metadata updates are automated via a Hugging Face Space.
Search quality: strengths and weaknesses
- Semantic search surfaces conceptually similar work even without exact keyword overlap; several users report discovering new relevant papers.
- Others find it weak for niche or overloaded terms (“leaky relu,” “wave function collapse algorithm,” specific astronomy sub-terms), where keyword-based arXiv/Scholar works better.
- Some recommend domain-specific or fine-tuned models to improve technical term handling.
- Author-name search is noted as a poor fit for pure semantic search.
Feature requests and UX issues
- Strong demand for filters, especially date/recency sorting and more dense results (collapsible abstracts).
- Users ask for “similar papers” links, citation/review integration, and explanatory “how to use” guidance (e.g., best to paste abstracts or arXiv IDs).
- Encoding and LaTeX/Markdown rendering bugs are reported.
- Cloudflare bot challenges are a dealbreaker for some, sparking debate about alternatives and broader web centralization problems.
Use cases, limitations, and research workflows
- Seen as useful for exploratory discovery and recommendations, but multiple commenters insist systematic reviews should not rely on semantic search or preprints alone.
- Proposed workflows include literature reviews, technology watch for industry and tax credit work, internal document and code search, and local/offline search via Docker.
- Some envision generative “literature overview” summaries over sets of retrieved papers as a next step.