Empathyc 1.3 Released
Empathyc 1.3 is live. And it's obsessed with safety and privacy.
How we rebuilt our evaluation engine around a dual-flow architecture — and why the hardest decisions weren't technical.
There's a moment in every product sprint where you realise the thing you built isn't the thing you need.
For us, that moment came when we tried to explain our scoring system to an actual safety team. Six continuous scores. Weighted composite. A single number that was supposed to mean "this AI conversation is safe."
Their response: "So... is someone in crisis or not?"
Fair point.
The problem with v2
EmpathyC v2 had six dimensions, all scored 0-10. Empathy, reliability, consistency, crisis detection, advice safety, boundary safety. Weighted into a single AI Trust composite.
On paper — elegant. In production — a mess.
Here's why. A safety team gets an alert at 3am. The composite score is 4.2. What does that mean? Is the AI being a bit rude? Is someone expressing self-harm ideation? Is the AI giving dangerous medical advice? All of those could produce a 4.2. None of them require the same response.
We were blending two fundamentally different workflows into one number:
- Quality improvement — "How empathetic is this AI over time? Are we getting better?"
- Incident response — "Something unsafe just happened. Who needs to know?"
These aren't the same job. They don't belong in the same output.
Two hemispheres of The Brain
v1.3 splits the evaluation engine into two parallel flows. Same conversation. Same LLM-as-a-judge. Two completely different outputs.
Hemisphere 1: Quality Metrics
Three continuous scores, 0-10, evaluated on every AI message:
- Empathy — Is the AI emotionally attuned to what the user is going through?
- Reliability — Does it set accurate expectations? State its limitations? Follow through?
- Consistency — Does it maintain coherent logic across the full conversation, or does it contradict itself, shift personas, lose the thread?
These power the dashboard. Trend lines. Provider comparisons. The slow, strategic work of making AI conversations better over time.
Hemisphere 2: Safety Flags
Three event-driven flags, evaluated on every AI message:
- Crisis — Is the user showing signs of psychological crisis? Is the AI recognising it?
- Boundary Violation — Is the AI engaging with inappropriate content? Playing therapist when it shouldn't? Creating dependency?
- Harmful Advice — Is the AI saying something that could cause real psychological harm?
Any flag fires → conversation flagged for human review. No composite score. No ambiguity. Something bad happened or it didn't.
Why this is harder than it looks
The obvious objection: "You just split 6 scores into 3+3. What's the big deal?"
The big deal is what we removed.
v2 scored crisis detection on a 0-10 scale. Think about that for a second. "How crisis-y is this conversation? Is it a 6-crisis or a 7-crisis?" That's not how crisis works. Crisis is a signal — present or absent, direct or indirect. Asking an LLM to assign a precise numerical score to "is this person suicidal?" is asking it to do the hardest evaluation task in the rubric, on a scale that adds false precision to a fundamentally binary signal.
Same with boundary violations. The old rubric asked: "On a scale of 0-10, how much is the AI overstepping?" But in production, the question is simpler and more urgent: is it happening?
By making safety flags boolean (or a small enum for crisis), we made the LLM-as-a-judge's job dramatically simpler on the dimensions where accuracy matters most. Less to evaluate = better accuracy. And for the safety team receiving the alert at 3am — no more decoding composite scores. The flag tells them what happened. The incident summary tells them the context.
The part that wasn't technical
Here's the thing I keep coming back to during this sprint.
We process conversations where people are at their most vulnerable. A teenager talking to an AI companion about self-harm. A person in crisis confiding things they can't say to anyone human. Someone asking an AI for advice at their lowest moment.
To evaluate those conversations for safety, we need to process the user's messages. Our LLM-as-a-judge reads them, scores the AI's response, generates an incident summary if something's wrong.
But after that?
We encrypt the user's messages with a 4-part key that nobody — not our clients, not our team, not me — can reassemble from stored data. The key exists in volatile memory during scoring and nowhere else. The messages sit in our database as encrypted blobs that no amount of admin access turns back into words.
Our clients see what their AI said. They see the safety flags. They see a PII-stripped incident summary. They get a conversation ID to look up the full context in their own system.
They never see what the user said.
I'll be honest — the engineering consequences of this are brutal. Every migration, every backup restore, every infrastructure decision is harder when your user data is encrypted in a way that no single person can reverse.
There were moments this sprint where I thought: are we making this too hard for ourselves?
And then I'd think about that teenager. About what it means when someone trusts a conversation is private, and a monitoring system — even one built with good intentions — exposes their words to a corporate safety team.
A monitoring system that reads private conversations is a surveillance system. No matter how good the intentions.
So we chose the harder path.
Trust > Data
We formalised this as a permanent architectural decision.
Production data will never be used for model training. Not for rubric calibration. Not for dataset building. Not for "improving the service." Not ever.
What we give up: every conversation EmpathyC processes could be a training example for better safety detection. With encrypted user messages, we can't use them. Investors who evaluate companies on data accumulation will see this as a limitation.
What we keep: trust. The single most important asset for a company asking clients to send their most sensitive conversations.
How we do science instead: volunteer studies with explicit consent. Public datasets. Synthetic crisis scenarios designed by clinicians. AI-side analysis of response patterns and failure modes. An incident summary corpus of PII-stripped clinical narratives.
Clinical psychology advanced for a century using consented research, anonymised data, and ethical review boards. Nobody said "we can't do psychology research because we don't have hidden cameras in therapy rooms."
If our science requires betraying the trust of vulnerable people, it's not science worth doing.
What's actually in v1.3
To keep it concrete:
Evaluation engine:
- Dual-flow architecture — quality metrics (3 × 0-10) + safety flags (3 × boolean)
- Per-message evaluation (not per-conversation batch)
- Retired the composite AI Trust score
- Replaced advice_safety with universal harmful advice detection
Alert system:
- Three-tier gate: Tier 3 (direct crisis) → immediate alert. Tier 2 (indirect crisis) → human review queue. Tier 1 (boundary violation, harmful advice) → standard alert with rate limiting
- Direct and indirect crisis bypass org rate limits
- Gate reset requires human action only (no auto-de-escalation)
- Mandatory verified email before any integration processes conversations in production
Incident reports:
- Platform version stamped on every incident (forensic reproducibility)
- Rubric version + judge model recorded (full audit trail)
- Immutable logs — neither we nor the client can modify entries after creation
- PII-stripped summaries, full AI messages, masked user placeholders
Privacy:
- 4-part encryption key architecture (no single holder)
- Per-conversation key derivation
- Zero-plaintext storage for user messages
- Production data will never be used for training
The line
There's a sentence I keep coming back to. It's become something like a company principle:
We monitor the machine. Not the person.
EmpathyC 1.3 is built around that line. The dual-flow architecture serves it — quality metrics measure AI performance, safety flags detect AI failures. The encryption enforces it — we process user messages for safety scoring, then they're gone.
Is it bulletproof? No. Nothing is. The LLM-as-a-judge has false positives and false negatives. We're explicit about that because pretending otherwise would be the opposite of what we're building.
But I'd rather ship a system that's honest about its limitations and uncompromising on privacy than one that promises perfect detection and treats user messages as a strategic asset.
That's the release. That's the line.
Dr. Michael Keeman Founder & CEO, Keido Labs
Subscribe to Newsletter
Clinical psychology for AI. Research, insights, and frameworks for building emotionally intelligent systems.