2024-12-25

Show HN: I made a website to semantically search ArXiv papers

Project scope and related tools

Site provides semantic search over arXiv, with companion versions for bioRxiv and medRxiv (not yet fully synchronized).
Compared to tools like Semantic Scholar, arxivxplorer, OpenAlex-based systems, Research Rabbit, emergentmind, and academic search / workflow tools (undermind, scite, elicit, paper-qa, txtai, paperai, paperetl, Semantra).
Some suggest integrating with external tools (e.g., paper-qa, OpenReview) and drawing on existing arXiv embeddings datasets.

Technical implementation and performance

Uses MixedBread embeddings, chosen for small size, strong leaderboard performance, binary and matryoshka support.
Embeddings are binarized and stored in Milvus; binary Hamming search yields large latency improvements (~hundreds of ms).
Acknowledged tradeoff: top ~10 results are similar to full-precision search, but quality drops quickly beyond that.
Suggestions include reranking a larger Hamming-retrieved candidate set with full-precision scores, using shorter embeddings instead of binarization, trying brute-force CPU search with SIMD, and exploring hybrid (keyword + vector) retrieval.
Weekly metadata updates are automated via a Hugging Face Space.

Search quality: strengths and weaknesses

Semantic search surfaces conceptually similar work even without exact keyword overlap; several users report discovering new relevant papers.
Others find it weak for niche or overloaded terms (“leaky relu,” “wave function collapse algorithm,” specific astronomy sub-terms), where keyword-based arXiv/Scholar works better.
Some recommend domain-specific or fine-tuned models to improve technical term handling.
Author-name search is noted as a poor fit for pure semantic search.

Feature requests and UX issues

Strong demand for filters, especially date/recency sorting and more dense results (collapsible abstracts).
Users ask for “similar papers” links, citation/review integration, and explanatory “how to use” guidance (e.g., best to paste abstracts or arXiv IDs).
Encoding and LaTeX/Markdown rendering bugs are reported.
Cloudflare bot challenges are a dealbreaker for some, sparking debate about alternatives and broader web centralization problems.

Use cases, limitations, and research workflows

Seen as useful for exploratory discovery and recommendations, but multiple commenters insist systematic reviews should not rely on semantic search or preprints alone.
Proposed workflows include literature reviews, technology watch for industry and tax credit work, internal document and code search, and local/offline search via Docker.
Some envision generative “literature overview” summaries over sets of retrieved papers as a next step.

Related topics