Smuggling arbitrary data through an emoji
Core technique and behavior
- Variation selector codepoints after a base character (often an emoji) can encode arbitrary bytes while rendering as a single visible glyph or even as plain text.
- The hidden payload survives copy‑paste across many apps and websites, even when the emoji itself is stripped or normalized away.
- Data can be nested (e.g., UTF‑8 inside the payload; “turtles all the way down”), and even emoji can be encoded inside emoji.
Steganography and related tricks
- Commenters relate this to classic steganography: zero‑width spaces, ZWJ/ZWNJ, Unicode tags, private‑use areas, invisible programs in source code, and image metadata chunks.
- Tools like StegCloak and zws.im, and prior hacks (hidden data in GIFs, image alpha channels, or PNG/TIFF metadata) are cited as similar ideas.
- Some argue private‑use characters are simpler but note they usually render visibly, unlike variation selectors.
Watermarking, fingerprinting, and tracking
- Several see this as a lightweight way to watermark or sign LLM outputs, short texts, articles, or quotes; or to embed user IDs, timestamps, or logprobs.
- Others argue it’s trivially strippable, likely removed by pre‑processing, and inferior to sampler‑based probabilistic watermarking (biasing token choices).
- Skeptics doubt any AI watermarking will be robust; proponents point to printer dot watermarks as a counterexample.
- Suggested uses include leaker fingerprinting and personalized ad or link tracking.
Security, privacy, and Unicode abuse
- Concerns raised about “visually identical” links or text carrying hidden data; experiments show payloads appear in URL query logs but are constrained in domains by punycode/percent‑encoding rules.
- Examples given of past Unicode abuse: RTL overrides in filenames to disguise extensions, Trojan Source attacks, CTF challenges, and buffer overflows from multi‑byte characters.
- Some foresee abuse for C2 channels, prompt injection, filter evasion, or ID tokens in an emoji; others stress this is “abuse of Unicode” and advise against real‑world deployment.
Tooling, detection, and accessibility
- Many editors, terminals, and web forms silently accept these characters; some truncate or display boxes, but “view source” often looks normal.
- Workarounds include hex dumps, tokenizer tools, Unicode‑highlighting in editors, and Emacs/Vim configs that surface invisible or variation‑selector codepoints.
- Unicode normalization explicitly does not strip variation selectors, so standard normalization won’t remove these payloads.
- Screen readers may announce variation selectors as hex codes when navigating by character, making long payloads noisy but not obviously meaningful.
LLM behavior
- Users tested multiple LLMs on decoding examples: most failed or guessed common strings like “hello,” unless given access to a programming environment.
- With tools (e.g., Python/JS), some models can programmatically decode the scheme, suggesting pattern‑matching alone is insufficient for reliable decoding.
Applications and prior art
- Real or proposed uses include: content source maps in CMS previews, cross‑platform message ID bridging without a DB, bypassing word filters, hidden commands in chat, and digital ID tokens in an ID‑card emoji.
- A prior patent on embedding hidden Unicode content to trigger actions is mentioned; this triggers a long side discussion on software patents, “defensive” vs offensive use, and whether such patents are ethically or practically justifiable.