Show HN: Stun LLMs with thousands of invisible Unicode characters

Nostalgia and “enshittification” of the internet

  • Several commenters use this project as a springboard to lament the modern web: bot-blockers, slow interstitials, ad-driven platforms, and “ragebait” content.
  • Some argue the “old internet” was already full of spam and bots, but that today’s problem is more about engagement manipulation than crude viagra spam.
  • There’s a sense that LLMs add damage on top of an ecosystem already degraded by social media and ads.

How the Unicode obfuscation works and LLM robustness

  • The tool injects invisible and look‑alike Unicode characters to confuse LLMs or their safety layers, while remaining mostly readable to humans.
  • Some argue models will just learn to normalize or treat these tokens as equivalent, only slightly slowing learning.
  • Others note that modern pretraining pipelines already do heavy filtering (language detection, spam/“educational” filters), which may simply exclude such weird text.

Scrapers, preprocessing, and the arms race

  • Many believe this is trivially bypassed by scrapers via regex/Unicode normalization or stripping zero‑width and unusual characters, or by rendering pages and using OCR.
  • Counterpoint: blindly stripping non‑ASCII or “weird” chars breaks legitimate languages and diacritics; there is no universal “junk Unicode” set.
  • Past tools (e.g., heavy Unicode corruption like “klmbr”) initially broke models but newer models handle them, suggesting obfuscation is short‑lived.

Accessibility, SEO, and human usability

  • Strong consensus that this is “terrible” for screen readers: audio output becomes unusable or letter‑by‑letter noise.
  • Concerns that it would harm accessibility, may ruin SEO, and can even break editors and PDFs; some report copy/paste issues in browsers.
  • Several commenters explicitly ask people not to deploy this on real sites for these reasons.

Experiments and behavior of different LLMs

  • Users test various models (GPT, Claude, Gemini, Grok, Qwen, etc.) with mixed results:
    • Some decode the hidden text or strip zero‑width chars easily, sometimes even generating code to clean it.
    • Others refuse to answer or output safety messages, seemingly treating it as obfuscated/prompt‑injection content.
  • The main practical effect, where it works, is to cause refusals or off‑topic answers when students copy‑paste gibberified prompts.

Alternative defenses and broader reflections

  • Suggested alternatives: inserting invisible CBRN/red‑team prompts to trigger safety filters, ASCII art, RTL/bottom‑to‑top text, or just using robots.txt plus legal/regulatory tools.
  • Several commenters think the only long‑term way to “beat” LLMs would be to make text illegible to humans too, which is self‑defeating.