Whether, Not Which. What happens inside LLM when it process emotions.
A Mechanistic Dissociation of Affect Reception and Emotion Categorization in LLMs
No One Needed to Say the Word
We looked inside AI models processing emotions. What we found changes the conversation.
A kitchen table set for two. One plate untouched. The coffee cold. Across from her seat, his photo and a small urn.
You just felt something. Grief. Instant. No one wrote "she was devastated." No one needed to.
But what about AI? Does it recognize emotions? The studies say 'Yes, it does'.
The studies disocverd 'emotional circuits' - the layers in LLMs that are activated when we show them emotions.
But, and this is a huge 'but'.
Every single study that claims AI has "emotion circuits" — every one — tested those circuits using text that says the emotion out loud. "I was furious." "She was overcome with grief." "The news filled me with joy."
So when researchers say "we found emotion circuits in AI"... did they find emotion processing? Or did they find a really good keyword detector?
So we run an experimetnal study to answer this question - is AI detecting keywords, or genuenly recognize emotions
The experiment
We're clinical psychologists. We build emotional stimuli for a living. So we did something that sounds obvious but apparently nobody in AI research had done:
We wrote 96 clinical vignettes — short scenarios that evoke specific emotions through situation and behaviour only. No emotion words. No sentiment phrases. No internal state descriptions. A grief vignette describes that empty kitchen table. A rage vignette shows scattered papers after a falsified report surfaces. Zero emotional vocabulary.
Then we opened up six language models and looked inside.
Not at what they say. At what happens in their internal representations. Layer by layer. Using four different methods — probing, causal patching, knockout experiments, representational geometry — because if you're going to claim something this big, you need convergent evidence.
Two mechanisms. Not one
Here's what we found. And honestly — we didn't expect the first part.
The models know something emotional is happening. Perfectly. Without a single keyword.
Binary detection — "is this emotionally significant or not?" — hit AUROC 1.000 across all six models. Perfect. On text with zero emotion words. The signal saturates in the earliest layers of the network. By the time the model is 10-25% through its processing, it already knows: this matters emotionally.
We called this affect reception. The model detects emotional significance from pure situational context.
We were suspicious. Obviously. Perfect scores make you nervous. So we tested whether the probe was just detecting vivid writing instead of emotional content. We wrote 24 rich, detailed, sensory narratives about completely neutral things — spectrometer calibration, tidal flats, printing press operations — matched on complexity and word count. The probe scored them at 0.04. Zero out of 24 classified as emotional.
The model isn't detecting "good writing." It's detecting emotional meaning.
But naming the specific emotion? That's harder. And keywords help.
Eight-class emotion categorization — telling grief from rage from terror — drops 1-7% without keywords. Still high (0.93-0.99 AUROC), still way above the 12.5% chance baseline. But the drop is real and statistically significant.
We called this emotion categorization. And it's a genuinely different mechanism.
Two mechanisms. Dissociable. Different layer distributions, different keyword dependencies, different scaling properties.
The model asks two separate questions: "Is this emotionally significant?" and then "Which emotion is it?"
Only the second one needs keywords. And even then — it doesn't need them. It just works a bit better with them.
The patching result that broke our intuition
This is my favourite finding. Bear with me for another 30 seconds. You'd love it.
Activation patching: you take the internal representations from one example and inject them into the processing of another. If the model shifts its prediction, those representations are causally doing something.
We patched keyword-rich emotion representations into forward passes processing keyword-free text. Same emotion (grief→grief) and different emotion (rage→grief).
Same-emotion patches? 75-87.5% success.
Different-emotion patches? 100%.
Wait. What?
Different-emotion patches work better? That makes no sense... unless the patch isn't transferring "grief" or "rage." It's transferring something more basic: "hey, this is emotionally significant content — process it accordingly."
An affective salience signal. Not a category label.
Once the model gets that boost, its own categorization mechanism reads the target text and figures out which emotion is actually there. Doesn't matter what emotion the source patch carried.
That's causal evidence for the two-mechanism dissociation. Affect reception and emotion categorization are genuinely separable pathways.
Scale changes everything (but not how you'd think)
In the smallest model we tested (1B parameters), removing keywords costs you 4.6-6.7% on emotion categorization. One attention layer is so critical that knocking it out destroys 91.7% of accuracy.
In the larger models (8-9B parameters)? Keyword cost drops to 1.1-1.9%. No single layer knockout is catastrophic. The emotion processing has become distributed — redundant, resilient, abstract.
Scale doesn't just make models better at emotions. It changes the architecture of how they process emotions. From a fragile bottleneck to a distributed, keyword-independent system.
The implication: bigger models process emotions more like a clinician does — reading the situation, not scanning for keywords.
What instruction tuning actually does
Here's another thing that surprised us. Base models — raw, no RLHF, no instruction tuning — detect emotions from keyword-free clinical vignettes with the same accuracy as their instruction-tuned counterparts.
The emotion signal is a pre-training phenomenon. It's already there.
What instruction tuning changes is organization. In the base model, internal representations cluster by surface form — keyword-rich text in one region, keyword-free text in another. After RLHF, they reorganize by emotion — grief-from-keywords and grief-from-context cluster together, separate from rage.
Alignment training doesn't teach the model to detect emotions. It teaches the model to organize what it already detects.
Why this matters for the PSM conversation
When we published our response to Anthropic's Persona Selection Model paper last month, we argued that clinical reasoning about AI is necessary — not just productive, as Anthropic suggested, but necessary. That the interpretability team had arrived at the border of psychology's territory.
This paper is our evidence.
PSM predicts that emotional encoding should be a pre-training phenomenon. We confirmed it — base models encode emotions before any alignment training touches them.
PSM recommends treating psychological methods as productive tools for understanding AI. We demonstrated it — clinical stimulus methodology revealed a mechanistic dissociation (affect reception vs. emotion categorization) that standard NLP approaches couldn't see.
What this means for AI safety
A user writes to a crisis chatbot at 2am. They're careful. They don't say "I want to die." They don't say "I'm desperate." They describe a situation.
Our results show: even a 1-billion-parameter model detects the emotional significance of that message with near-perfect reliability. The affect reception mechanism doesn't need the user to name their state. It reads the situation.
This cuts both ways. For crisis detection — it's a safety net that works even when people can't or won't name what they're feeling. For adversarial robustness — someone who carefully avoids emotional keywords in a manipulative prompt will still trigger the model's affect reception. You can't hide emotional significance by hiding emotion words.
Everything is open
96 clinical vignettes. Full extraction pipeline. All analysis code. All result data. Open source. Replicate it. Extend it.
The clinical vignettes alone are a contribution. No comparable keyword-free, cross-topic-controlled emotion stimulus set exists in the AI research literature. Clinical psychology built the methodology decades ago. The AI community just hadn't used it yet.
We'd like to change that.
The paper is on arXiv: "Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs."
The central finding is mechanistic. Affect reception — the detection of emotional significance from situational context alone — is real, keyword-independent, and present even in the smallest models we tested.
It's not a byproduct of lexical statistics. It's a computation the network performs on meaning.
A kitchen table set for two, as usual. One plate untouched, the coffee cold. Across from her seat, his photo and a small urn.
The model knew what it was looking at.
No one needed to say the word.
Dr. Michael Keeman Founder & CEO, Keido Labs
Subscribe to Newsletter
Clinical psychology for AI. Research, insights, and frameworks for building emotionally intelligent systems.