2025-02-12

Smuggling arbitrary data through an emoji

Core technique and behavior

Variation selector codepoints after a base character (often an emoji) can encode arbitrary bytes while rendering as a single visible glyph or even as plain text.
The hidden payload survives copy‑paste across many apps and websites, even when the emoji itself is stripped or normalized away.
Data can be nested (e.g., UTF‑8 inside the payload; “turtles all the way down”), and even emoji can be encoded inside emoji.

Steganography and related tricks

Commenters relate this to classic steganography: zero‑width spaces, ZWJ/ZWNJ, Unicode tags, private‑use areas, invisible programs in source code, and image metadata chunks.
Tools like StegCloak and zws.im, and prior hacks (hidden data in GIFs, image alpha channels, or PNG/TIFF metadata) are cited as similar ideas.
Some argue private‑use characters are simpler but note they usually render visibly, unlike variation selectors.

Watermarking, fingerprinting, and tracking

Several see this as a lightweight way to watermark or sign LLM outputs, short texts, articles, or quotes; or to embed user IDs, timestamps, or logprobs.
Others argue it’s trivially strippable, likely removed by pre‑processing, and inferior to sampler‑based probabilistic watermarking (biasing token choices).
Skeptics doubt any AI watermarking will be robust; proponents point to printer dot watermarks as a counterexample.
Suggested uses include leaker fingerprinting and personalized ad or link tracking.

Security, privacy, and Unicode abuse

Concerns raised about “visually identical” links or text carrying hidden data; experiments show payloads appear in URL query logs but are constrained in domains by punycode/percent‑encoding rules.
Examples given of past Unicode abuse: RTL overrides in filenames to disguise extensions, Trojan Source attacks, CTF challenges, and buffer overflows from multi‑byte characters.
Some foresee abuse for C2 channels, prompt injection, filter evasion, or ID tokens in an emoji; others stress this is “abuse of Unicode” and advise against real‑world deployment.

Tooling, detection, and accessibility

Many editors, terminals, and web forms silently accept these characters; some truncate or display boxes, but “view source” often looks normal.
Workarounds include hex dumps, tokenizer tools, Unicode‑highlighting in editors, and Emacs/Vim configs that surface invisible or variation‑selector codepoints.
Unicode normalization explicitly does not strip variation selectors, so standard normalization won’t remove these payloads.
Screen readers may announce variation selectors as hex codes when navigating by character, making long payloads noisy but not obviously meaningful.

LLM behavior

Users tested multiple LLMs on decoding examples: most failed or guessed common strings like “hello,” unless given access to a programming environment.
With tools (e.g., Python/JS), some models can programmatically decode the scheme, suggesting pattern‑matching alone is insufficient for reliable decoding.

Applications and prior art

Real or proposed uses include: content source maps in CMS previews, cross‑platform message ID bridging without a DB, bypassing word filters, hidden commands in chat, and digital ID tokens in an ID‑card emoji.
A prior patent on embedding hidden Unicode content to trigger actions is mentioned; this triggers a long side discussion on software patents, “defensive” vs offensive use, and whether such patents are ethically or practically justifiable.

Related topics