Auto-grading decade-old Hacker News discussions with hindsight

Overall reaction to the experiment

  • Many find the project clever and fun: using LLMs to hindsight‑grade decade‑old HN threads nicely showcases how cheap large‑scale text analysis has become.
  • Others see it as emblematic of “AI slop”: vibe‑coded, interesting as a toy, but not rigorous enough for serious conclusions.
  • Several commenters want more: repeat this for more years, make browser extensions to surface “top predictors,” and run it on their own comment histories or email archives.

Surveillance, panopticon, and dystopia

  • The “LLMs are watching, best to be good” line triggers strong pushback as dystopian.
  • Critics argue it normalizes a panopticon: everything logged now, reconstructed and judged later by states, corporations, or AIs.
  • Some tie this to existing mass surveillance (e.g. post‑Snowden) and see LLMs as a new analysis layer on already‑captured data.
  • A few propose resistance strategies: refusing to “build the torment nexus,” socially stigmatizing such work, poisoning data, and shifting norms around what to post at all.

Quality, bias, and misuse of LLM grading

  • Multiple spot‑checks show hallucinated “predictions,” misread nuance, and grading of mere history lessons or preferences as if they were forecasts.
  • The model often rewards consensus or “aligned” viewpoints, which critics say effectively grades conformity rather than prescience.
  • Known users appear to be recognized despite usernames, raising concerns about identity bias; some suggest anonymization or style normalization, others note stylometry makes that hard.
  • Commenters worry results will be over‑trusted and that similar methods could be applied to high‑stakes domains without proper validation.

Predictions, forecasting, and ‘boring but right’

  • A recurring observation: many highly rated comments are status‑quo takes or “boring but right” predictions, not bold contrarian calls.
  • Several argue good evaluation should weight falsifiability, non‑triviality, and how off‑consensus a prediction was at the time.
  • Prediction markets, calibration training, and explicit probabilistic forecasts are mentioned as more principled alternatives.

Reputation systems and scoring users

  • Some are excited by the idea of long‑term accuracy scores per user and weighting upvotes by forecaster quality, potentially improving discussion quality.
  • Others warn this would shrink communities, intensify echo chambers, and incentivize ultra‑safe takes.
  • There are comparisons to older systems (Slashdot meta‑moderation, Reddit tools, “superforecasters”), and suggestions to focus on grading atomic facts or explicit predictions instead of free‑form commentary.

HN, archives, and meta

  • Commenters praise HN as a “good web citizen”: stable URLs, public archives, and tools like thread replayers make this kind of retrospective possible.
  • There’s discussion of timestamp manipulation via the “second chance pool” and whether that misrepresents chronology.
  • Some note HN’s tendency toward meta‑obsession, while moderators acknowledge “meta as catnip” but treat this thread as an exception.