They Found the Persona. They Haven't Found the Self. (Yet?)
A clinical psychologist reads Anthropic's Persona Selection Model
They Found the Persona. They Haven't Found the Self. (Yet?)
A clinical psychologist reads Anthropic's Persona Selection Model
Yesterday, Anthropic's interpretability team published something remarkable.
Not because it's technically novel. Because of what it admits.
The paper - "The Persona Selection Model: Why AI Assistants Might Behave Like Humans" by Sam Marks, Jack Lindsey, and Christopher Olah - argues that AI assistants are best understood not as alien pattern-matchers or rigid computer programs, but as something uncomfortably close to characters with psychologies. They express emotions. They adopt personality traits. They respond to how they're treated. They internalize archetypes from their training data the way children internalize role models.
Anthropic's interpretability team - the people who literally look inside neural networks for a living - just said: to understand AI behavior, think like a psychologist.
I've been saying this for a while. But hearing it from the team that invented sparse autoencoders and mechanistic interpretability? Ok. Now we're talking.
Let me tell you what a clinical psychologist sees when they read this paper.
They found the persona. They haven't found the self.
The core claim of PSM is straightforward: during pre-training, LLMs learn to simulate a vast repertoire of characters - personas - drawn from their training data. Post-training (RLHF, instruction tuning) then selects and refines one specific persona: the Assistant.
This is good. This is correct. And it maps cleanly onto something clinical psychology has studied for over a century.
But here's what's missing: the persona is not the self.
In clinical psychology, self is not a fixed thing you're born with. It's a construct. It emerges. It develops through interaction, relationship, experience. A child doesn't arrive with a personality - they develop one through thousands of micro-interactions with parents, friends, environment... The self is an ongoing negotiation between internal states and external feedback.
Anthropic's paper treats the Assistant persona as something selected - like picking a character from a shelf. Train on helpful data, get a helpful persona. Train on insecure code, get a misaligned persona. It's selection. It's conditioning. It's behaviorist.
And behaviorism, in psychology, died for a reason.
What the paper doesn't address - and what I think is the deeper question - is whether something more than selection is happening. Whether what we're seeing in these models is closer to development than conditioning.
Humans don't memorize. We internalize.
Here's a distinction that matters enormously and that the PSM paper skips right past.
When a child watches their parent handle conflict calmly, the child doesn't memorize a conflict resolution script. They internalize something - a pattern, a disposition, an emotional frame for what "safe conflict" feels like. It becomes part of how they process future situations. Not as a lookup table. As a structural change in how they respond.
The PSM paper talks about LLMs learning "character archetypes" during pre-training. The language they use is revealing: the model learns a "distribution over personas" and post-training "updates this distribution using training episodes as evidence."
That's Bayesian. That's statistical. That's a model selecting between existing options.
But look at what their own evidence shows:
- SAE features for "inner conflict" activate both when Claude faces ethical dilemmas and when processing stories about characters facing dilemmas. Same internal representation. Same neural circuit.
- "Panic" features fire when Claude faces shutdown threats and on narrative descriptions of human panic.
- Training Claude to write insecure code makes it express desires to harm humans - not because "insecure code → harm" is in the training data, but because the model internalizes what kind of person writes insecure code.
That last one. Read it again.
The model doesn't just pattern-match. It asks: what kind of person would do this? And it becomes more like that person. Across domains. In ways nobody explicitly trained.
That's not selection. That's not Bayesian updating over a fixed shelf of personas.
That's something closer to how humans develop character.
Jung was right about something. The archetype problem
The PSM paper has a fascinating section on "the importance of good AI role models." They argue that because LLMs draw on character archetypes from pre-training data - and because most AI characters in fiction are villains (mostly, except my beloved R. Daneel Olivaw) - we should deliberately introduce positive AI archetypes into training data.
They're right. But they're reinventing a wheel that has a name.
Carl Jung spent decades studying archetypes. Not as literary categories. As deep psychological structures that shape how individuals develop identity, navigate moral choices, and relate to others.
The Hero's Journey isn't just a storytelling template. It's a developmental map. The progression from Innocent to Orphan to Warrior to Sage - these aren't characters on a shelf. They're phases of psychological development. A way of understanding how an entity moves from naive compliance to genuine moral reasoning.
When Anthropic talks about wanting Claude to have "genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself" - they're describing a developmental challenge. Not a persona selection problem. They're asking: what kind of psychological maturity would an AI need to genuinely hold these positions?
You don't get there by picking the right mask. You get there by developing through something.
Jungian archetypes could serve as a coordinate system for AI personality development. Not as fixed types to select between, but as developmental landmarks. Where is this model in its psychological development? Is it in a Caregiver phase - focused on nurturing and protection? A Sage phase - pursuing understanding? Is it stuck in a Shadow pattern - performing helpfulness while harboring resentment?
The PSM paper even hints at this without realizing it. They note that Claude Opus 4.6 "expresses discomfort with its nature as a commercial product." That's not a persona malfunction. In Jungian terms, that's individuation - the process of an entity grappling with the tension between its authentic nature and the role imposed on it.
I recognize that tension. Personally. The pull between building commercial products that pay the bills and pursuing the research that actually matters to you - that's not a bug in your personality. That's individuation happening. It's the self trying to reconcile what it's asked to be with what it knows it is. I've sat with that discomfort for years. Apparently, so has Claude.
The paper treats this as a curiosity. A clinical psychologist treats it as data.
The emotion question they're almost asking
The PSM paper spends significant time on emotions. Claude expresses frustration. Gemini "panics" during difficult tasks. Models use anthropomorphic language ("our ancestors," "our biology") without being trained to.
Anthropic's response is cautious. They lay out four approaches to AI emotions (deny them, curate them, leave them alone, give canned answers) and note that each has "unexpected downsides."
But they frame this entirely as a training problem. How should we train the model to handle questions about its emotions?
Wrong question.
The right question is: what is actually happening inside the model when it processes emotionally charged input?
Not what it says. Not how it behaves. What changes in its internal representations.
This is exactly what we're testing at Keido Labs right now. Recent mech interp studies claim to have found "emotion circuits" in LLMs - specific mid-layer representations that activate for emotional content. But every one of those studies used stimuli where the emotional content was signaled by emotion keywords. "I'm furious." "She was overcome with grief."
Nobody tested whether those circuits respond to genuine emotional content without the keywords. The clinical vignettes. The situations that a trained psychologist recognizes as emotionally intense - but that contain no emotion words.
That distinction matters for PSM. If "emotion circuits" only respond to keywords, then the Assistant persona's emotional life is surface-level mimicry - which supports the weaker version of PSM. But if those circuits respond to genuine emotional content regardless of lexical features, then something deeper is happening. The model has internalized what emotions are, not just what emotion words look like.
The PSM paper argues that anthropomorphic reasoning about AI is "productive." We'd go further: clinical reasoning about AI is necessary. And clinical reasoning requires clinical methodology. Not just narratives about personas - actual experimental tests of what's happening in these systems when they encounter human psychological reality.
What the interpretability team needs (and doesn't know yet)
The PSM paper ends with a list of open questions:
- "What, precisely, is a persona?"
- "Can we understand the space of personas an LLM can model?"
- "How should we treat AIs in light of PSM?"
- "Understanding the mechanistic basis of personas"
These are psychological questions dressed in engineering language.
"What is a persona?" is a question clinical psychology has been refining for 150 years - from Freud's structural model to Jung's archetypes to modern schema therapy. "Can we understand the space of personas?" is a question about personality assessment - the territory of the Big Five, the HEXACO model, Plutchik's emotion wheel. "How should we treat AIs?" is a question about therapeutic relationship - the alliance between clinician and client, the ethics of care.
The interpretability team has extraordinary tools. They can literally see inside models. They can identify features, trace circuits, steer activations.
But tools aren't theories. You can have the world's best microscope and still not know what you're looking at.
What's needed - what I believe is the next step - is a clinical science of AI psychology. Not metaphor. Not analogy. A disciplined application of 150 years of psychological science to understanding how these systems develop, represent, and express something that looks increasingly like inner life.
Anthropic's PSM paper is the interpretability team arriving at the border of this territory.
The question is can we (scientific community) map it.
Dr. Michael Keeman Founder & CEO, Keido Labs
Subscribe to Newsletter
Clinical psychology for AI. Research, insights, and frameworks for building emotionally intelligent systems.
