Smuggling arbitrary data through an emoji

Core technique and behavior

  • Variation selector codepoints after a base character (often an emoji) can encode arbitrary bytes while rendering as a single visible glyph or even as plain text.
  • The hidden payload survives copy‑paste across many apps and websites, even when the emoji itself is stripped or normalized away.
  • Data can be nested (e.g., UTF‑8 inside the payload; “turtles all the way down”), and even emoji can be encoded inside emoji.

Steganography and related tricks

  • Commenters relate this to classic steganography: zero‑width spaces, ZWJ/ZWNJ, Unicode tags, private‑use areas, invisible programs in source code, and image metadata chunks.
  • Tools like StegCloak and zws.im, and prior hacks (hidden data in GIFs, image alpha channels, or PNG/TIFF metadata) are cited as similar ideas.
  • Some argue private‑use characters are simpler but note they usually render visibly, unlike variation selectors.

Watermarking, fingerprinting, and tracking

  • Several see this as a lightweight way to watermark or sign LLM outputs, short texts, articles, or quotes; or to embed user IDs, timestamps, or logprobs.
  • Others argue it’s trivially strippable, likely removed by pre‑processing, and inferior to sampler‑based probabilistic watermarking (biasing token choices).
  • Skeptics doubt any AI watermarking will be robust; proponents point to printer dot watermarks as a counterexample.
  • Suggested uses include leaker fingerprinting and personalized ad or link tracking.

Security, privacy, and Unicode abuse

  • Concerns raised about “visually identical” links or text carrying hidden data; experiments show payloads appear in URL query logs but are constrained in domains by punycode/percent‑encoding rules.
  • Examples given of past Unicode abuse: RTL overrides in filenames to disguise extensions, Trojan Source attacks, CTF challenges, and buffer overflows from multi‑byte characters.
  • Some foresee abuse for C2 channels, prompt injection, filter evasion, or ID tokens in an emoji; others stress this is “abuse of Unicode” and advise against real‑world deployment.

Tooling, detection, and accessibility

  • Many editors, terminals, and web forms silently accept these characters; some truncate or display boxes, but “view source” often looks normal.
  • Workarounds include hex dumps, tokenizer tools, Unicode‑highlighting in editors, and Emacs/Vim configs that surface invisible or variation‑selector codepoints.
  • Unicode normalization explicitly does not strip variation selectors, so standard normalization won’t remove these payloads.
  • Screen readers may announce variation selectors as hex codes when navigating by character, making long payloads noisy but not obviously meaningful.

LLM behavior

  • Users tested multiple LLMs on decoding examples: most failed or guessed common strings like “hello,” unless given access to a programming environment.
  • With tools (e.g., Python/JS), some models can programmatically decode the scheme, suggesting pattern‑matching alone is insufficient for reliable decoding.

Applications and prior art

  • Real or proposed uses include: content source maps in CMS previews, cross‑platform message ID bridging without a DB, bypassing word filters, hidden commands in chat, and digital ID tokens in an ID‑card emoji.
  • A prior patent on embedding hidden Unicode content to trigger actions is mentioned; this triggers a long side discussion on software patents, “defensive” vs offensive use, and whether such patents are ethically or practically justifiable.