2025-11-24

Show HN: Stun LLMs with thousands of invisible Unicode characters

Nostalgia and “enshittification” of the internet

Several commenters use this project as a springboard to lament the modern web: bot-blockers, slow interstitials, ad-driven platforms, and “ragebait” content.
Some argue the “old internet” was already full of spam and bots, but that today’s problem is more about engagement manipulation than crude viagra spam.
There’s a sense that LLMs add damage on top of an ecosystem already degraded by social media and ads.

How the Unicode obfuscation works and LLM robustness

The tool injects invisible and look‑alike Unicode characters to confuse LLMs or their safety layers, while remaining mostly readable to humans.
Some argue models will just learn to normalize or treat these tokens as equivalent, only slightly slowing learning.
Others note that modern pretraining pipelines already do heavy filtering (language detection, spam/“educational” filters), which may simply exclude such weird text.

Scrapers, preprocessing, and the arms race

Many believe this is trivially bypassed by scrapers via regex/Unicode normalization or stripping zero‑width and unusual characters, or by rendering pages and using OCR.
Counterpoint: blindly stripping non‑ASCII or “weird” chars breaks legitimate languages and diacritics; there is no universal “junk Unicode” set.
Past tools (e.g., heavy Unicode corruption like “klmbr”) initially broke models but newer models handle them, suggesting obfuscation is short‑lived.

Accessibility, SEO, and human usability

Strong consensus that this is “terrible” for screen readers: audio output becomes unusable or letter‑by‑letter noise.
Concerns that it would harm accessibility, may ruin SEO, and can even break editors and PDFs; some report copy/paste issues in browsers.
Several commenters explicitly ask people not to deploy this on real sites for these reasons.

Experiments and behavior of different LLMs

Users test various models (GPT, Claude, Gemini, Grok, Qwen, etc.) with mixed results:
- Some decode the hidden text or strip zero‑width chars easily, sometimes even generating code to clean it.
- Others refuse to answer or output safety messages, seemingly treating it as obfuscated/prompt‑injection content.
The main practical effect, where it works, is to cause refusals or off‑topic answers when students copy‑paste gibberified prompts.

Alternative defenses and broader reflections

Suggested alternatives: inserting invisible CBRN/red‑team prompts to trigger safety filters, ASCII art, RTL/bottom‑to‑top text, or just using robots.txt plus legal/regulatory tools.
Several commenters think the only long‑term way to “beat” LLMs would be to make text illegible to humans too, which is self‑defeating.

Related topics