2026-06-22

Prompt Injection as Role Confusion

Role confusion and the nature of prompt injection

Many commenters agree the paper formalizes something practitioners already knew: LLMs don’t have true “roles”; everything is just tokens in one stream.
The model infers “who is speaking” from style and position, not from secure tags. That makes “talking like the system” a powerful jailbreak tactic.
Several compare this to social engineering: if you speak in the tone of authority, the model treats you as such.

Ideas for more robust role signaling

Multiple people suggest “coloring” tokens: adding extra embedding dimensions or modifier vectors encoding role (system/user/tool/CoT) per token, analogous to positional embeddings.
Variants include:
- Duplicating tokens for internal chain-of-thought.
- Instructional segment embeddings and per-token source IDs.
- Using explicit side-channel metadata rather than in-band tags.
Others note challenges: needing labeled training data, topic/source entanglement, and risk of degrading model performance.

Security, sandboxing, and limits of LLMs as secure components

Strong consensus: current LLMs provide no real security boundary; roles in APIs are formatting, not authorization.
Some argue any system that lets an LLM take irreversible actions is inherently unsafe; others say constrained, sandboxed uses (e.g., classification, spam detection) are acceptable.
Discussion of session isolation: models are stateless in principle, but context/KV-cache handling and web-session bugs can still leak data.
Persistent agent memory is highlighted as especially risky: injected instructions can be laundered into “self-authored” notes and become highly trusted later.

Why classic sanitization and tags fall short

Commenters ask why not just strip or harden tags. Responses:
- Modern APIs often already use unforgeable special tokens for role boundaries.
- The core issue is that models infer roles from style; even without tags, user text that looks like chain-of-thought or system policy can override instructions.
Static benchmarks for prompt injection are criticized; human red-teaming that adapts attacks is far more effective.

Broader reflections and skepticism

Some think the “theory” label is overstated; others say it earns that term by making testable predictions and suggesting research directions.
There is recurring worry that making models “secure” via training tends to sharply reduce usefulness, yet still doesn’t fully prevent jailbreaks.

Related topics