Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Overall quality and user experience

  • Many users report surprisingly good recommendations from just a few books; often 70–95% of suggestions are titles they’ve already read and liked.
  • Others find results “fine but not magical”: too close to bookstore-style “more of the same”, dominated by popular titles and bestsellers.
  • Works better with 3+ books or Goodreads import; one‑book inputs tend to add generic popular titles.
  • Site speed, simplicity, and lack of popups/logins are widely praised.

Series, authors, and diversity

  • Common complaint: recommendations over-focus on:
    • Later books in the same series.
    • Many titles from the same author.
  • Users want:
    • Option to hide sequels and/or authors already in the input.
    • Visual cues or separate sections for “in series” vs “other” books.
    • More diverse lists (fewer near-duplicates, less author/series repetition).
  • The author acknowledges series handling is the biggest weakness and has added a diversity reranker (e.g., maximal marginal relevance).

Negative feedback, novelty, and long tail

  • Strong desire for explicit negative signals:
    • Mark “read, liked”, “read, didn’t like”, “hide”, or “meh” and rerun.
  • Many want better discovery:
    • Less emphasis on extremely popular books (Harry Potter, Sapiens, 1984, etc.).
    • Options to surface rarer / long‑tail titles and “deep cuts”.
    • Multiple recommendation modes (comfort zone vs exploration/serendipity).

Intersect feature

  • Concept (finding users who read multiple given books) is praised as powerful for hidden gems.
  • In practice, several users get:
    • No matches for long lists.
    • Only huge, likely fake accounts with tens of thousands of books and no ratings.
  • Suggested improvements:
    • Near matches when no exact overlap.
    • Filter by shelf size, remove obvious bots.
    • Optionally consider ratings (not just “read”).

Technical and architectural discussion

  • Author uses a SASRec-style sequential transformer for “next book in sequence”.
  • Other practitioners suggest exploring HSTU/OneRec, BERT4Rec, TIGER, and hybrid stacks (content-based, graph-based, TF‑IDF/BM25) combined for novelty and serendipity.
  • Infrastructure details (Hetzner server, Meilisearch ~40GB, GPU inference) and “how it works” attract interest; some ask for open-sourcing and an API.

Scraping, legality, and ethics

  • Significant debate over scraping 3B Goodreads reviews:
    • Critics cite robots.txt and ToS; some reviewers feel their work was “stolen” or used without consent.
    • Others argue the data is already public and heavily scraped; see this use as relatively harmless and noncommercial.
    • Legal status of redistributing the dataset is seen as risky; the author declines to share raw data and points to an academic dataset instead.
  • Some users request removal of their data from the system; others express general discomfort with their reviews being used for ML at all.

Safety and privacy concerns

  • Worry that the “intersect” feature could be abused to profile readers of controversial books.
  • Suggestions to treat some titles as “always private” for intersections or allow community-maintained sensitive lists.
  • Author notes Goodreads itself already exposes similar user–book associations, claims not to include private accounts, and offers an opt‑out mechanism.