Auto-grading decade-old Hacker News discussions with hindsight
Overall reaction to the experiment
- Many find the project clever and fun: using LLMs to hindsight‑grade decade‑old HN threads nicely showcases how cheap large‑scale text analysis has become.
- Others see it as emblematic of “AI slop”: vibe‑coded, interesting as a toy, but not rigorous enough for serious conclusions.
- Several commenters want more: repeat this for more years, make browser extensions to surface “top predictors,” and run it on their own comment histories or email archives.
Surveillance, panopticon, and dystopia
- The “LLMs are watching, best to be good” line triggers strong pushback as dystopian.
- Critics argue it normalizes a panopticon: everything logged now, reconstructed and judged later by states, corporations, or AIs.
- Some tie this to existing mass surveillance (e.g. post‑Snowden) and see LLMs as a new analysis layer on already‑captured data.
- A few propose resistance strategies: refusing to “build the torment nexus,” socially stigmatizing such work, poisoning data, and shifting norms around what to post at all.
Quality, bias, and misuse of LLM grading
- Multiple spot‑checks show hallucinated “predictions,” misread nuance, and grading of mere history lessons or preferences as if they were forecasts.
- The model often rewards consensus or “aligned” viewpoints, which critics say effectively grades conformity rather than prescience.
- Known users appear to be recognized despite usernames, raising concerns about identity bias; some suggest anonymization or style normalization, others note stylometry makes that hard.
- Commenters worry results will be over‑trusted and that similar methods could be applied to high‑stakes domains without proper validation.
Predictions, forecasting, and ‘boring but right’
- A recurring observation: many highly rated comments are status‑quo takes or “boring but right” predictions, not bold contrarian calls.
- Several argue good evaluation should weight falsifiability, non‑triviality, and how off‑consensus a prediction was at the time.
- Prediction markets, calibration training, and explicit probabilistic forecasts are mentioned as more principled alternatives.
Reputation systems and scoring users
- Some are excited by the idea of long‑term accuracy scores per user and weighting upvotes by forecaster quality, potentially improving discussion quality.
- Others warn this would shrink communities, intensify echo chambers, and incentivize ultra‑safe takes.
- There are comparisons to older systems (Slashdot meta‑moderation, Reddit tools, “superforecasters”), and suggestions to focus on grading atomic facts or explicit predictions instead of free‑form commentary.
HN, archives, and meta
- Commenters praise HN as a “good web citizen”: stable URLs, public archives, and tools like thread replayers make this kind of retrospective possible.
- There’s discussion of timestamp manipulation via the “second chance pool” and whether that misrepresents chronology.
- Some note HN’s tendency toward meta‑obsession, while moderators acknowledge “meta as catnip” but treat this thread as an exception.