Prompt Injection as Role Confusion

Role confusion and the nature of prompt injection

  • Many commenters agree the paper formalizes something practitioners already knew: LLMs don’t have true “roles”; everything is just tokens in one stream.
  • The model infers “who is speaking” from style and position, not from secure tags. That makes “talking like the system” a powerful jailbreak tactic.
  • Several compare this to social engineering: if you speak in the tone of authority, the model treats you as such.

Ideas for more robust role signaling

  • Multiple people suggest “coloring” tokens: adding extra embedding dimensions or modifier vectors encoding role (system/user/tool/CoT) per token, analogous to positional embeddings.
  • Variants include:
    • Duplicating tokens for internal chain-of-thought.
    • Instructional segment embeddings and per-token source IDs.
    • Using explicit side-channel metadata rather than in-band tags.
  • Others note challenges: needing labeled training data, topic/source entanglement, and risk of degrading model performance.

Security, sandboxing, and limits of LLMs as secure components

  • Strong consensus: current LLMs provide no real security boundary; roles in APIs are formatting, not authorization.
  • Some argue any system that lets an LLM take irreversible actions is inherently unsafe; others say constrained, sandboxed uses (e.g., classification, spam detection) are acceptable.
  • Discussion of session isolation: models are stateless in principle, but context/KV-cache handling and web-session bugs can still leak data.
  • Persistent agent memory is highlighted as especially risky: injected instructions can be laundered into “self-authored” notes and become highly trusted later.

Why classic sanitization and tags fall short

  • Commenters ask why not just strip or harden tags. Responses:
    • Modern APIs often already use unforgeable special tokens for role boundaries.
    • The core issue is that models infer roles from style; even without tags, user text that looks like chain-of-thought or system policy can override instructions.
  • Static benchmarks for prompt injection are criticized; human red-teaming that adapts attacks is far more effective.

Broader reflections and skepticism

  • Some think the “theory” label is overstated; others say it earns that term by making testable predictions and suggesting research directions.
  • There is recurring worry that making models “secure” via training tends to sharply reduce usefulness, yet still doesn’t fully prevent jailbreaks.