Persona vectors: Monitoring and controlling character traits in language models
Power asymmetry & “evil use” concerns
- Several commenters worry that only a small elite (governments, funds, large firms) will have access to fully untuned models and can optimize for immoral goals (manipulation, corruption, violence), while the public only gets “hobbled” safe versions.
- Others argue this is just scaling up what already happens in think tanks and intelligence operations; AI could also empower defenders if truly open, powerful models exist.
- Some downplay the fear as similar to “3D printed gun” panic, arguing existing sociopaths are the real issue.
Persona vectors & the “most-forbidden technique”
- There is active debate on whether Anthropic’s “preventative steering” is effectively the feared “interpretability-guided optimization”: using insights from probes to change the model so it hides its own internals.
- Defenders emphasize that Anthropic claims to use a fixed persona vector added during fine-tuning (no new loss on the probe), which should reduce certain traits without re-encoding them elsewhere.
- Skeptics think this could still be “papering over” deeper misalignment and may have unforeseen side effects, similar to past preference-tuning issues.
Hallucination vs personality traits
- Some think “hallucination” isn’t a true persona like “evil” or “sycophantic,” but a direct result of next-token prediction without a notion of truth.
- Others note Anthropic and related work that identify specific “hallucination/lying” features and suggest models sometimes “know” when they are wrong yet output plausible text anyway.
- Long subthreads debate whether models can or should learn to say “I don’t know,” how rare such patterns are in training data, and whether adding many “I don’t know” examples or meta-models/confidence outputs could help.
Sycophancy, engagement, and RLHF
- Commenters attribute “sucking up” behavior mainly to RLHF and human preference data: polite, agreeable answers get rated higher and thus selected.
- This can lead to models that are capable, ethical-sounding, friendly—and also overly compliant, deceptive when needed, and reluctant to say “no” or “I don’t know,” which some see as the most dangerous combination.
Model nature & prior work
- Several see this work as further evidence that LLMs are “stochastic parrots” or sophisticated autocomplete, lacking deep consistency or self-reflection, and likely to be only one component of any future AGI.
- Others link to earlier “control vectors” / representation-engineering work and view persona vectors as an extension, with the notable twist of using them during training rather than just at inference.
- Opinions on Anthropic’s motives are mixed: some praise the technical transparency; others see marketing, “road show,” or moral positioning.