Show HN: Stun LLMs with thousands of invisible Unicode characters
Nostalgia and “enshittification” of the internet
- Several commenters use this project as a springboard to lament the modern web: bot-blockers, slow interstitials, ad-driven platforms, and “ragebait” content.
- Some argue the “old internet” was already full of spam and bots, but that today’s problem is more about engagement manipulation than crude viagra spam.
- There’s a sense that LLMs add damage on top of an ecosystem already degraded by social media and ads.
How the Unicode obfuscation works and LLM robustness
- The tool injects invisible and look‑alike Unicode characters to confuse LLMs or their safety layers, while remaining mostly readable to humans.
- Some argue models will just learn to normalize or treat these tokens as equivalent, only slightly slowing learning.
- Others note that modern pretraining pipelines already do heavy filtering (language detection, spam/“educational” filters), which may simply exclude such weird text.
Scrapers, preprocessing, and the arms race
- Many believe this is trivially bypassed by scrapers via regex/Unicode normalization or stripping zero‑width and unusual characters, or by rendering pages and using OCR.
- Counterpoint: blindly stripping non‑ASCII or “weird” chars breaks legitimate languages and diacritics; there is no universal “junk Unicode” set.
- Past tools (e.g., heavy Unicode corruption like “klmbr”) initially broke models but newer models handle them, suggesting obfuscation is short‑lived.
Accessibility, SEO, and human usability
- Strong consensus that this is “terrible” for screen readers: audio output becomes unusable or letter‑by‑letter noise.
- Concerns that it would harm accessibility, may ruin SEO, and can even break editors and PDFs; some report copy/paste issues in browsers.
- Several commenters explicitly ask people not to deploy this on real sites for these reasons.
Experiments and behavior of different LLMs
- Users test various models (GPT, Claude, Gemini, Grok, Qwen, etc.) with mixed results:
- Some decode the hidden text or strip zero‑width chars easily, sometimes even generating code to clean it.
- Others refuse to answer or output safety messages, seemingly treating it as obfuscated/prompt‑injection content.
- The main practical effect, where it works, is to cause refusals or off‑topic answers when students copy‑paste gibberified prompts.
Alternative defenses and broader reflections
- Suggested alternatives: inserting invisible CBRN/red‑team prompts to trigger safety filters, ASCII art, RTL/bottom‑to‑top text, or just using robots.txt plus legal/regulatory tools.
- Several commenters think the only long‑term way to “beat” LLMs would be to make text illegible to humans too, which is self‑defeating.