Show HN: I scraped 3B Goodreads reviews to train a better recommendation model
Overall quality and user experience
- Many users report surprisingly good recommendations from just a few books; often 70–95% of suggestions are titles they’ve already read and liked.
- Others find results “fine but not magical”: too close to bookstore-style “more of the same”, dominated by popular titles and bestsellers.
- Works better with 3+ books or Goodreads import; one‑book inputs tend to add generic popular titles.
- Site speed, simplicity, and lack of popups/logins are widely praised.
Series, authors, and diversity
- Common complaint: recommendations over-focus on:
- Later books in the same series.
- Many titles from the same author.
- Users want:
- Option to hide sequels and/or authors already in the input.
- Visual cues or separate sections for “in series” vs “other” books.
- More diverse lists (fewer near-duplicates, less author/series repetition).
- The author acknowledges series handling is the biggest weakness and has added a diversity reranker (e.g., maximal marginal relevance).
Negative feedback, novelty, and long tail
- Strong desire for explicit negative signals:
- Mark “read, liked”, “read, didn’t like”, “hide”, or “meh” and rerun.
- Many want better discovery:
- Less emphasis on extremely popular books (Harry Potter, Sapiens, 1984, etc.).
- Options to surface rarer / long‑tail titles and “deep cuts”.
- Multiple recommendation modes (comfort zone vs exploration/serendipity).
Intersect feature
- Concept (finding users who read multiple given books) is praised as powerful for hidden gems.
- In practice, several users get:
- No matches for long lists.
- Only huge, likely fake accounts with tens of thousands of books and no ratings.
- Suggested improvements:
- Near matches when no exact overlap.
- Filter by shelf size, remove obvious bots.
- Optionally consider ratings (not just “read”).
Technical and architectural discussion
- Author uses a SASRec-style sequential transformer for “next book in sequence”.
- Other practitioners suggest exploring HSTU/OneRec, BERT4Rec, TIGER, and hybrid stacks (content-based, graph-based, TF‑IDF/BM25) combined for novelty and serendipity.
- Infrastructure details (Hetzner server, Meilisearch ~40GB, GPU inference) and “how it works” attract interest; some ask for open-sourcing and an API.
Scraping, legality, and ethics
- Significant debate over scraping 3B Goodreads reviews:
- Critics cite robots.txt and ToS; some reviewers feel their work was “stolen” or used without consent.
- Others argue the data is already public and heavily scraped; see this use as relatively harmless and noncommercial.
- Legal status of redistributing the dataset is seen as risky; the author declines to share raw data and points to an academic dataset instead.
- Some users request removal of their data from the system; others express general discomfort with their reviews being used for ML at all.
Safety and privacy concerns
- Worry that the “intersect” feature could be abused to profile readers of controversial books.
- Suggestions to treat some titles as “always private” for intersections or allow community-maintained sensitive lists.
- Author notes Goodreads itself already exposes similar user–book associations, claims not to include private accounts, and offers an opt‑out mechanism.