2025-11-05

Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Overall quality and user experience

Many users report surprisingly good recommendations from just a few books; often 70–95% of suggestions are titles they’ve already read and liked.
Others find results “fine but not magical”: too close to bookstore-style “more of the same”, dominated by popular titles and bestsellers.
Works better with 3+ books or Goodreads import; one‑book inputs tend to add generic popular titles.
Site speed, simplicity, and lack of popups/logins are widely praised.

Series, authors, and diversity

Common complaint: recommendations over-focus on:
- Later books in the same series.
- Many titles from the same author.
Users want:
- Option to hide sequels and/or authors already in the input.
- Visual cues or separate sections for “in series” vs “other” books.
- More diverse lists (fewer near-duplicates, less author/series repetition).
The author acknowledges series handling is the biggest weakness and has added a diversity reranker (e.g., maximal marginal relevance).

Negative feedback, novelty, and long tail

Strong desire for explicit negative signals:
- Mark “read, liked”, “read, didn’t like”, “hide”, or “meh” and rerun.
Many want better discovery:
- Less emphasis on extremely popular books (Harry Potter, Sapiens, 1984, etc.).
- Options to surface rarer / long‑tail titles and “deep cuts”.
- Multiple recommendation modes (comfort zone vs exploration/serendipity).

Intersect feature

Concept (finding users who read multiple given books) is praised as powerful for hidden gems.
In practice, several users get:
- No matches for long lists.
- Only huge, likely fake accounts with tens of thousands of books and no ratings.
Suggested improvements:
- Near matches when no exact overlap.
- Filter by shelf size, remove obvious bots.
- Optionally consider ratings (not just “read”).

Technical and architectural discussion

Author uses a SASRec-style sequential transformer for “next book in sequence”.
Other practitioners suggest exploring HSTU/OneRec, BERT4Rec, TIGER, and hybrid stacks (content-based, graph-based, TF‑IDF/BM25) combined for novelty and serendipity.
Infrastructure details (Hetzner server, Meilisearch ~40GB, GPU inference) and “how it works” attract interest; some ask for open-sourcing and an API.

Scraping, legality, and ethics

Significant debate over scraping 3B Goodreads reviews:
- Critics cite robots.txt and ToS; some reviewers feel their work was “stolen” or used without consent.
- Others argue the data is already public and heavily scraped; see this use as relatively harmless and noncommercial.
- Legal status of redistributing the dataset is seen as risky; the author declines to share raw data and points to an academic dataset instead.
Some users request removal of their data from the system; others express general discomfort with their reviews being used for ML at all.

Safety and privacy concerns

Worry that the “intersect” feature could be abused to profile readers of controversial books.
Suggestions to treat some titles as “always private” for intersections or allow community-maintained sensitive lists.
Author notes Goodreads itself already exposes similar user–book associations, claims not to include private accounts, and offers an opt‑out mechanism.

Related topics