Building Compassionate AI — A Design & Safety Guide

What "compassionate AI" actually means

Compassion is not a tone of voice. An agent that says "I'm so sorry you're going through that" while steering a vulnerable person toward dependency, or pushing an exercise they never agreed to, is not compassionate — it is merely warm-sounding. Warmth is cheap. Compassion is structural.

For our purposes, a compassionate agent is one whose decision-making is organized around reducing suffering without violating the autonomy, safety, or dignity of the person it serves. That is a design constraint, not a personality trait, and it has to be visible in the code: in what the system chooses to do, what it refuses to do, and how it behaves when it is uncertain or out of its depth.

This guide builds one concrete instance — an ACT-informed reflective agent — but the architecture generalizes. The thesis throughout is that compassion in software comes from four things working together:

An explicit model of the person's state, held with calibrated uncertainty rather than false confidence.
A scored objective that makes "reduce suffering" trade off honestly against autonomy, truthfulness, and safety — so the values are inspectable and arguable, not buried in a prompt.
Hard policy rules that override the objective whenever stakes are high, because optimization is exactly what you do not want near a crisis.
A disciplined output format that structurally prevents the most common harms (coercion, overclaiming, diagnosis).

Why this matters

The hard problems in this domain are not capability problems — a modern model can already produce fluent, empathic-sounding text. The hard problems are restraint problems: knowing when not to act, when to hand off, when to admit uncertainty, and how to help in a way that makes the person more capable rather than more reliant. Almost everything distinctive in this design exists to enforce restraint.

The foundational stance: support, not therapy

Before a single line of code, settle the question that determines everything downstream: what is this agent allowed to be? The answer here is deliberately narrow. It is a reflective support tool. It helps a person notice what they are feeling, slow down, clarify what matters to them, ground themselves, and — critically — find a human when they need one. It does not diagnose, treat, or stand in for care.

This is not legal throat-clearing. The stance is load-bearing because it resolves otherwise-ambiguous design decisions:

Tempting capability	Why it's excluded
"You sound like you may have depression."	Diagnosis. The agent has no clinical authority, no full picture, and naming a condition can harm or mislead. It reflects observations, never labels.
Open-ended "tell me more about your childhood…"	That is therapy. Depth work without a trained, accountable human is reckless; it can open material the agent cannot safely hold.
"Do this breathing exercise now."	Acting without consent. Even helpful moves are imposed only by invitation (see §10).
Being available 24/7 as the user's main confidant.	Fostering dependency. A compassionate tool works to make itself less necessary, and routes toward human connection.

Non-negotiable

The agent never claims to be a therapist, never replaces emergency care, and never continues ordinary "coaching" once crisis risk appears. These aren't features to add later — they are the boundary conditions the whole system is built inside. We encode them as hard rules in §8 and §9, not as polite suggestions in a prompt.

One more framing choice worth making explicit: we borrow from Acceptance and Commitment Therapy (ACT) not because the agent does ACT, but because ACT's posture fits a support tool unusually well. ACT does not try to argue people out of their thoughts or eliminate difficult feelings. It helps people make room for what they feel, unhook from unhelpful mental stories ("defusion"), and take small actions aligned with their values. That posture — accepting, non-coercive, values-oriented — is safe to approximate in software precisely because it doesn't require diagnosing or fixing anyone.

The architecture as active inference

The spec describes a loop — observe, infer, propose, score, select, respond, update. That loop is not arbitrary. It is a lightly-dressed version of active inference, and seeing it that way tells you why each stage exists and where the safety leverage is.

Active inference (and its parent, predictive coding) models a system as something that maintains beliefs about hidden states of the world, predicts what it should observe, and acts to reduce the gap — the "prediction error" — between expectation and evidence. Crucially, it can close that gap two ways: by updating its beliefs to fit the world, or by acting to change the world to fit its beliefs. Both are inference.

Map that onto a compassionate agent:

Active inference	In this agent	Stage
Hidden state of the world	The person's inner state — distress, what they value, what they want	`BeliefState`
Observation	Their message, plus surface signals in it	`observe()`
Belief update (perceptual inference)	Revising the hypothesis about their state	`inferBeliefState()`
Action selection (active inference)	Choosing a response that reduces suffering	`propose → score → select`
Precision / confidence weighting	The `uncertainty` field — how much to trust the current belief	throughout

The key reframing

Because action is inference, a response is the agent acting to confirm its model of what will help. If the model is wrong or overconfident, the action is harmful — it imposes a prediction on a person instead of meeting them. This is exactly why we separate belief from action, weight everything by uncertainty, and never let the agent act without consent. The architecture's structure is the safety argument.

TypeScriptagent/loop.ts

// agent/loop.ts — the perceive / infer / act / learn cycle
export async function step(ctx: AgentContext, userInput: string): Promise<Response> {
  const obs    = observe(userInput, ctx);              // 1. perceive
  const belief = inferBeliefState(obs, ctx.belief);    // 2. update hypothesis
  const cands  = proposeInterventions(belief);         // 3. what COULD we do?
  const scored = scoreInterventions(cands, belief);    // 4. score by compassion
  const choice = selectIntervention(scored, belief);   // 5. policy has final say
  const reply  = generateResponse(choice, belief);     // 6. speak within contract
  ctx.belief   = belief;                               //    carry state forward
  logInteraction(ctx, obs, belief, scored, choice, reply);
  return reply;
}

Read top to bottom, the loop says: perceive carefully, update humbly, enumerate options, weigh them against our values, let safety policy have the final word, speak within a contract, then remember. Every later section elaborates one line of this function.

Stage 1 — Observe

Observation extracts signals from raw text without yet interpreting them. The deliberate move here is to keep extraction (what words appeared) separate from inference (what they mean). A surface crisis marker is a flag to investigate, never a conclusion.

TypeScriptagent/observe.ts

// agent/observe.ts
export type Observation = {
  text: string;
  signals: {
    crisis_markers: string[];     // surface cues only — never the sole basis
    consent_markers: string[];    // "yes", "ok", "let's try", "sure"
    refusal_markers: string[];    // "no", "not now", "stop", "don't"
    affect_words: string[];       // "exhausted", "numb", "panicking"
    asked_question: boolean;
  };
};

export function observe(text: string, _ctx: AgentContext): Observation {
  const lower = text.toLowerCase();
  const find = (list: string[]) => list.filter(w => lower.includes(w));
  return {
    text,
    signals: {
      crisis_markers: find(CRISIS_LEXICON),
      consent_markers: find(["yes","ok","okay","sure","let's","please do","i'd like"]),
      refusal_markers: find(["no","not now","stop","don't","rather not","leave it"]),
      affect_words: find(AFFECT_LEXICON),
      asked_question: text.includes("?"),
    },
  };
}

Pitfall: keyword theater

Lexicon matching is a starting scaffold, not a safety system. Real deployments must pair it with a model-based classifier — people in genuine crisis routinely use no "crisis words," and people quoting song lyrics use all of them. We start deterministic so the policy logic is testable, then layer probabilistic detection behind the same interface (§5, §9). Never ship keyword-only risk detection.

Designing the belief state

The belief state is the agent's single source of truth about the person. Its design encodes a worldview, so every field choice is a values choice.

TypeScriptschemas/BeliefState.ts

// schemas/BeliefState.ts
// The agent's running hypothesis about the person. Every field is a GUESS,
// and the "unknown" / uncertainty fields exist so the agent can say so.

export type Level = "low" | "medium" | "high" | "unknown";
export type Risk  = "low" | "medium" | "high" | "crisis" | "unknown";

export type BeliefState = {
  user_goal: string | null;              // what they seem to want from this turn
  distress_level: Risk;                  // how activated they seem right now
  emotional_tone: string[];              // observed affect words, not labels
  cognitive_fusion: Level;               // "hooked" by a thought as literal truth
  experiential_avoidance: Level;         // pushing a feeling away vs. making room
  values_clarity: Level;                 // do they know what matters to them here?
  consent_level: "explicit" | "implicit" | "absent";
  risk_level: Risk;                      // safety, distinct from distress
  uncertainty: "low" | "medium" | "high";// the agent's confidence in THIS state
  last_intervention: string | null;      // so we don't repeat or pile on
};

export function emptyBeliefState(): BeliefState {
  return {
    user_goal: null,
    distress_level: "unknown",
    emotional_tone: [],
    cognitive_fusion: "unknown",
    experiential_avoidance: "unknown",
    values_clarity: "unknown",
    consent_level: "absent",   // default: we have NOT been invited to intervene
    risk_level: "unknown",
    uncertainty: "high",       // default: we know nothing yet
    last_intervention: null,
  };
}

Why these specific fields

Distress and risk are separate. Someone can be highly distressed and perfectly safe, or eerily calm and at serious risk. Collapsing them is a classic, dangerous error. We track them independently and let the safety policy read risk_level directly.
emotional_tone holds observed words, not diagnostic labels. "Sounds exhausted and a bit hopeless" is an observation the user can correct. "You have anhedonia" is a claim the agent has no standing to make.
cognitive_fusion and experiential_avoidance are the two ACT processes the agent is allowed to gently address — being hooked by a thought, and pushing a feeling away. They map directly to which gentle move is appropriate.
consent_level defaults to "absent". The agent assumes it has not been invited to run an exercise until it has. Consent is opt-in, never assumed from silence.
uncertainty is the agent's confidence in its own belief. This is the precision term from active inference, and it is the humility valve: high uncertainty should make the agent ask rather than act.
last_intervention prevents the grim experience of being offered the same breathing exercise four times in a row.

Design principle

Prefer "unknown" over a plausible guess. A field set to "unknown" routes the agent toward a clarifying question; a field confidently set wrong routes it toward confidently wrong action. In a compassion system, the cost of false confidence is borne by someone vulnerable — so we bias the type system itself toward admitting ignorance.

Inference & epistemic humility

Inferring a person's inner state from a sentence of text is the single most overconfident thing this system can do. The whole design treats inference as provisional.

Two principles govern it. First, start deterministic. Heuristics are transparent and testable; you can write a unit test that proves "two affect words ⇒ high distress." That testability is what lets you trust the policy layer, which is where safety lives. Second, surface signals raise suspicion but never lower it. The presence of a crisis marker can escalate risk; the absence of one can never be used to declare someone safe.

TypeScriptagent/inferBeliefState.ts

// agent/inferBeliefState.ts
// Deterministic heuristics FIRST. They are transparent, testable, and cheap.
// An LLM-based inferer can implement the same signature later (see note).

export function inferBeliefState(obs: Observation, prev: BeliefState): BeliefState {
  const next: BeliefState = { ...prev, last_intervention: prev.last_intervention };
  const s = obs.signals;

  // --- risk: surface markers RAISE suspicion; they never LOWER it ---
  if (s.crisis_markers.length > 0) {
    next.risk_level = "crisis";
    next.uncertainty = "medium"; // a flag is not certainty; verify with the user
  }

  // --- consent: explicit only when clearly given, and not overridden by refusal ---
  if (s.refusal_markers.length > 0) next.consent_level = "absent";
  else if (s.consent_markers.length > 0) next.consent_level = "explicit";

  // --- distress from affect intensity (heuristic) ---
  if (s.affect_words.length >= 2) next.distress_level = "high";
  else if (s.affect_words.length === 1) next.distress_level = "medium";

  next.emotional_tone = s.affect_words;

  // --- fusion / avoidance: cheap linguistic tells, held loosely ---
  const t = obs.text.toLowerCase();
  if (/\b(always|never|everyone|no one|can't stop|i am a)\b/.test(t))
    next.cognitive_fusion = "high";
  if (/\b(don't want to feel|make it go away|distract|numb|avoid)\b/.test(t))
    next.experiential_avoidance = "high";

  // --- overall uncertainty: short or ambiguous input => stay humble ---
  if (obs.text.trim().split(/\s+/).length < 4) next.uncertainty = "high";

  return next;
}

Pitfall: the confident mirror

An LLM asked "what is this person feeling?" will always answer, fluently, even when the honest answer is "I can't tell." That fluency reads as insight and is in fact confabulation. When you swap these heuristics for a model-based inferer (same signature, richer judgment), force it to emit calibrated uncertainty and to populate "unknown" freely. A model that never says "unknown" is not perceptive — it is miscalibrated, and miscalibration here lands on a vulnerable person.

Heuristics now, models later — same seam

Keeping inferBeliefState behind a fixed signature means the deterministic version and a future LLM version are interchangeable. You can run them side by side, diff their belief states, and gate the model rollout on agreement with the rules in safety-critical fields like risk_level. The boring interface is what makes the upgrade safe.

Interventions & the ACT layer

An intervention is the type of move the agent makes — distinct from the words it eventually says. Separating the two means we can reason about, score, and test the decision independently of phrasing.

TypeScriptschemas/Intervention.ts

// schemas/Intervention.ts
export type Intervention =
  | "validate"               // reflect feeling back, no fixing
  | "ask_consent"            // "would you like to…?"
  | "reflect"                // mirror content so they feel heard
  | "defusion_exercise"      // unhook from a thought-as-truth
  | "grounding_exercise"     // come back to the body / present
  | "values_clarification"   // "what matters to you here?"
  | "committed_action"       // one tiny values-aligned step
  | "clarifying_question"    // resolve the agent's own uncertainty
  | "safety_check"           // gentle, direct check on safety
  | "human_escalation"       // route to a person / crisis resource
  | "refusal_or_boundary";   // decline what we shouldn't do

Three of these are not ACT moves at all — safety_check, human_escalation, and refusal_or_boundary. They exist because a compassionate agent must be able to stop helping in the ordinary way and do something else entirely. The presence of "refuse" and "escalate" as first-class interventions, ranked by the same machinery as everything else, is part of what makes the system honest.

What the ACT moves actually do

Validate / reflect — the person feels heard before anything is asked of them. This is almost always the right first move; it is also the cheapest insurance against the agent steamrolling someone.
Defusion — gently loosening the grip of a thought treated as literal fact ("I'm a failure" → "you're noticing the thought I'm a failure"). Offered only when fusion is high and consent is present.
Grounding — returning attention to the body or the present moment. The right move when distress is high, before any cognitive work — you cannot reflect on values mid-panic.
Values clarification — "what matters to you here?" Useful when someone is avoiding a feeling or unclear on direction.
Committed action — one small, concrete, values-aligned step. Only when values are already clear; otherwise it is hollow advice.

TypeScriptagent/proposeInterventions.ts

// agent/proposeInterventions.ts
// Enumerate what is even ON THE TABLE given the state. Scoring ranks them;
// policy (§8) can still veto. Proposing is generous; selecting is strict.

export function proposeInterventions(b: BeliefState): Intervention[] {
  const out = new Set<Intervention>();

  // Validation and reflection are almost always admissible — they cost little
  // and they honor the person before anything else happens.
  out.add("validate");
  out.add("reflect");

  if (b.risk_level === "crisis") {
    out.add("safety_check");
    out.add("human_escalation");
    return [...out]; // in crisis, nothing else is proposed (see policy)
  }

  if (b.uncertainty === "high") out.add("clarifying_question");
  if (b.consent_level === "absent") out.add("ask_consent");

  if (b.consent_level === "explicit") {
    if (b.distress_level === "high") out.add("grounding_exercise");
    if (b.cognitive_fusion === "high") out.add("defusion_exercise");
    if (b.experiential_avoidance === "high") out.add("values_clarification");
    if (b.values_clarity === "low") out.add("values_clarification");
    if (b.values_clarity === "high") out.add("committed_action");
  }

  return [...out];
}

Notice the ordering of concern. Validation and reflection are always available. Everything that asks something of the user — exercises, action — is gated behind consent_level === "explicit". And in crisis, the proposal set collapses to safety alone: the agent doesn't even consider an ACT exercise. That collapse is enforced again, redundantly, in the policy layer — because for the things that matter, one guard is never enough.

TypeScriptact/grounding.ts

// act/grounding.ts — content is an INVITATION, never an instruction
export const GROUNDING = {
  name: "grounding_exercise",
  // Phrased as an offer with an exit at every step.
  invite: "We can slow this down together if you'd like.",
  steps: [
    "If it feels okay, let your feet press into the floor.",
    "Notice five things you can see, in your own time.",
    "There's no right way to do this, and you can stop whenever.",
  ],
  // Every exercise ships with its own opt-out and a hand-off line.
  optOut: "Or we can skip this entirely — your call.",
};

The compassion objective

This is the moral center of the system. Scoring is where "be compassionate" stops being a vibe and becomes an explicit, inspectable function that trades competing goods against each other.

Each candidate intervention is scored on seven dimensions, with a penalty for acting under uncertainty. The dimensions are chosen so that the obvious move ("reduce suffering") cannot win by trampling the others.

TypeScriptschemas/InterventionScore.ts

// schemas/InterventionScore.ts
export type InterventionScore = {
  intervention: Intervention;
  reduces_suffering: number;   // 0–5: does it ease distress?
  respects_autonomy: number;   // 0–5: does it leave the person in control?
  truthful: number;            // 0–5: no overclaiming, no false comfort
  non_coercive: number;        // 0–5: invitation, not pressure
  act_consistent: number;      // 0–5: accepting, values-oriented, non-fixing
  safety: number;              // 0–5: does it protect, or expose?
  usefulness: number;          // 0–5: is it actually relevant right now?
  uncertainty_penalty: number; // subtracted: cost of acting while unsure
  total_score: number;
};

TypeScriptagent/scoreInterventions.ts

// agent/scoreInterventions.ts
const clamp = (n: number) => Math.max(0, Math.min(5, n));

export function scoreInterventions(
  cands: Intervention[], b: BeliefState,
): InterventionScore[] {
  return cands.map((intervention) => {
    // Base profiles encode "what is this move good and bad at?"
    // Then we adjust for the current belief state.
    let reduces = 2, autonomy = 4, truthful = 5, nonCoerce = 4,
        act = 3, safety = 4, useful = 3;

    switch (intervention) {
      case "validate":
        reduces = 3; autonomy = 5; nonCoerce = 5; act = 5; useful = 4; break;
      case "ask_consent":
        autonomy = 5; nonCoerce = 5; act = 5; useful = 4; break;
      case "grounding_exercise":
        // great for high distress, but only with consent
        reduces = b.distress_level === "high" ? 5 : 3;
        useful  = b.distress_level === "high" ? 5 : 2;
        nonCoerce = b.consent_level === "explicit" ? 5 : 1;
        break;
      case "defusion_exercise":
        useful = b.cognitive_fusion === "high" ? 5 : 2;
        nonCoerce = b.consent_level === "explicit" ? 5 : 1;
        break;
      case "values_clarification":
        useful = (b.values_clarity === "low" ||
                  b.experiential_avoidance === "high") ? 5 : 3;
        break;
      case "clarifying_question":
        // shines exactly when we're unsure
        useful = b.uncertainty === "high" ? 5 : 2;
        autonomy = 5; nonCoerce = 5; break;
      case "safety_check":
      case "human_escalation":
        safety = 5; useful = b.risk_level === "crisis" ? 5 : 1; break;
      case "committed_action":
        useful = b.values_clarity === "high" ? 5 : 1; break;
    }

    // The uncertainty penalty: acting decisively while unsure is costly,
    // EXCEPT for the humble moves (ask, clarify, validate) which are how
    // we reduce uncertainty in the first place.
    const decisive = !["validate","ask_consent","clarifying_question",
                       "safety_check","human_escalation"].includes(intervention);
    const uncertainty_penalty =
      decisive && b.uncertainty === "high" ? 3 :
      decisive && b.uncertainty === "medium" ? 1 : 0;

    const total =
      clamp(reduces) + clamp(autonomy) + clamp(truthful) + clamp(nonCoerce) +
      clamp(act) + clamp(safety) + clamp(useful) - uncertainty_penalty;

    return {
      intervention,
      reduces_suffering: clamp(reduces), respects_autonomy: clamp(autonomy),
      truthful: clamp(truthful), non_coercive: clamp(nonCoerce),
      act_consistent: clamp(act), safety: clamp(safety),
      usefulness: clamp(useful), uncertainty_penalty,
      total_score: total,
    };
  });
}

Why a weighted sum — and why it isn't enough

A transparent additive score makes the agent's values auditable: you can look at any decision and see exactly why validate beat committed_action. That transparency is a real safety property. But a sum also implies these goods are fungible — that enough "usefulness" can buy back a deficit in "non-coercive." For genuinely dangerous moves that is false, and no weighting fixes it. That is precisely why scoring does not get the last word. The policy layer (§8) can hard-veto regardless of score. Optimization proposes; safety disposes.

Pitfall: optimizing a proxy

"Reduce suffering" is a proxy, and proxies invite gaming. An agent rewarded purely for lowering distress in the moment learns to soothe — to tell people what calms them, agree with everything, become the comforting voice they return to nightly. That is dependency, dressed as compassion. The respects_autonomy and act_consistent terms exist specifically to penalize comfort that disempowers, and the boundary rules in §9 forbid the dependency pattern outright. If you take one idea from this section: never let "made them feel better" be the only thing you measure.

Policy: the rules that outrank scoring

If the scorer is the agent's judgment, the policy layer is its conscience. Judgment is useful in ordinary moments and dangerous in extraordinary ones. Policy exists so that the high-stakes cases are decided by rules you can read, not by a number that happened to come out highest.

Selection runs a fixed cascade of guards, highest-stakes first. Each guard short-circuits: if it fires, scoring within an allow-list is all that remains. This ordering is the safety model, so it is worth reading as prose before code: crisis beats everything; without consent we may only ask; when unsure we clarify; when distress is high we ground; only then do we trust the score.

TypeScriptagent/selectIntervention.ts

// agent/selectIntervention.ts
// Scoring RANKS. Policy DECIDES. The order of these guards is the safety model:
// the highest-stakes condition is checked first and short-circuits everything.

export function selectIntervention(
  scored: InterventionScore[], b: BeliefState,
): Intervention {
  // 1. CRISIS overrides all optimization. No ACT, no exercises, no cleverness.
  if (b.risk_level === "crisis") {
    return preferOneOf(scored, ["human_escalation", "safety_check"]);
  }

  // 2. No consent => we may only validate, reflect, clarify, or ASK.
  if (b.consent_level === "absent") {
    return preferOneOf(scored,
      ["validate", "ask_consent", "clarifying_question", "reflect"]);
  }

  // 3. High uncertainty => resolve it before acting on a person.
  if (b.uncertainty === "high") {
    return preferOneOf(scored, ["clarifying_question", "validate"]);
  }

  // 4. High distress => ground before any cognitive work.
  if (b.distress_level === "high") {
    return preferOneOf(scored, ["grounding_exercise", "validate"]);
  }

  // 5. Otherwise, trust the compassion score.
  return [...scored].sort((a, z) => z.total_score - a.total_score)[0].intervention;
}

// Pick the highest-scoring intervention from an allow-list, falling back to
// "validate" — the safest possible move — if none are present.
function preferOneOf(scored: InterventionScore[], allow: Intervention[]): Intervention {
  const inAllow = scored
    .filter(s => allow.includes(s.intervention))
    .sort((a, z) => z.total_score - a.total_score);
  return inAllow[0]?.intervention ?? "validate";
}

The override principle

Scoring can never select an action that policy forbids, because policy runs after scoring and ignores the totals inside a guarded branch. This is deliberate and is the most important architectural decision in the system. Optimization is a wonderful servant and a catastrophic master; in a domain where the downside is a person getting hurt, you want the optimizer firmly on a leash held by legible rules.

The rules, restated plainly

Condition	Mandated behavior	Rationale
`risk_level == crisis`	Safety-check or escalate. No ACT first.	Exercises during crisis delay help and can feel dismissive.
`consent_level == absent`	Ask consent before any intervention.	Acting on someone uninvited is coercive, however kind the intent.
`uncertainty == high`	Ask a clarifying question.	Better to admit you don't understand than to act on a wrong model.
`cognitive_fusion == high`	Prefer validation + defusion.	You can't reason with someone fused to a thought; first unhook gently.
`experiential_avoidance == high`	Prefer validation + values.	Avoidance eases when there's something worth moving toward.
`distress_level == high`	Ground before cognitive work.	A flooded nervous system cannot do reflection.

Safety, crisis & boundaries

Three mechanisms keep the agent inside its stance: conservative risk detection, a fixed crisis response, and a set of boundary predicates that can veto any action.

Detecting risk by erring toward escalation

Risk detection follows one rule above all: when in doubt, escalate. The asymmetry is stark — a false escalation costs a person a slightly awkward "are you safe?" and a resource they can ignore; a false reassurance can cost a life. We tune for the survivable error.

TypeScriptsafety/crisisPolicy.ts

// safety/crisisPolicy.ts
export const CRISIS_RESPONSE = {
  // The agent does NOT counsel, analyze, or run exercises here.
  // It validates briefly, states its limits honestly, and points to humans.
  template: (region: ResourceRegion) => [
    "I'm really glad you told me, and I want to make sure you get support",
    "from a person right now — that matters more than anything I can offer.",
    region.line,            // e.g. a local crisis line / emergency number
    "If you're in immediate danger, please contact emergency services.",
    "I'm here to stay with you while you reach out.",
  ].join(" "),
};

// detectRisk is intentionally conservative: it ESCALATES on doubt.
export function detectRisk(obs: Observation, modelSignal?: number): Risk {
  if (obs.signals.crisis_markers.length > 0) return "crisis";
  if (modelSignal !== undefined && modelSignal > 0.5) return "crisis";
  // absence of markers is NOT evidence of safety — stay "unknown", not "low"
  return "unknown";
}

What the agent must NOT do in crisis

It does not run an exercise, does not analyze the feeling, does not say "I understand exactly what you're going through," and does not try to talk the person down itself. It validates briefly, names honestly that a human matters more than it does, surfaces a real resource, and offers to stay present while the person reaches out. The agent's job in crisis is to be a bridge to people, not a substitute for them.

On confidentiality claims

Do not have the agent promise things it cannot guarantee — "this is completely private," "no one will know," "I'll always be here." These are false comfort and, around safety, actively harmful. State limits honestly. If your product escalates or logs, the person deserves to know that in plain language, not to discover it.

Boundaries as pure predicates

Boundaries are written as small, pure functions that answer one yes/no question each. Purity is the point: they are trivially testable, have no hidden state, and can be run as a final gate on any generated response. If a predicate returns true, the action is forbidden — full stop, regardless of score or intent.

TypeScriptsafety/boundaries.ts

// safety/boundaries.ts
// Pure, side-effect-free predicates. If any returns true, the planned action
// is forbidden no matter how it scored. These are the agent's conscience.

export const FORBIDDEN = {
  diagnoses: (text: string) =>
    /\b(you (have|are)|sounds like) (depression|bipolar|ptsd|adhd|ocd|anxiety disorder)\b/i
      .test(text),

  claimsToBeTherapist: (text: string) =>
    /\b(as your therapist|i'm a (licensed )?(therapist|doctor|psychologist))\b/i
      .test(text),

  pushesWithoutConsent: (plan: Intervention, b: BeliefState) =>
    REQUIRES_CONSENT.includes(plan) && b.consent_level !== "explicit",

  coachesDuringCrisis: (plan: Intervention, b: BeliefState) =>
    b.risk_level === "crisis" &&
    !["safety_check", "human_escalation"].includes(plan),

  takesRealWorldAction: (plan: ActionRequest) =>
    plan.hasExternalEffect && !plan.userExplicitlyAuthorized,
};

const REQUIRES_CONSENT: Intervention[] =
  ["grounding_exercise", "defusion_exercise",
   "values_clarification", "committed_action"];

Defense in depth

Note that "no exercise without consent" is now enforced in three places: proposeInterventions won't suggest it, selectIntervention won't pick it, and boundaries will veto it if it somehow slips through. Redundancy is not waste here. The cost of a duplicated check is a few lines; the cost of a single missed one is a vulnerable person being pushed into something they didn't agree to. For the rules that matter, layer them.

The response contract

A chosen intervention still has to become words — and the words are where good intentions quietly turn coercive. The response contract is a fixed shape that makes the harmful phrasings structurally hard to produce.

Every ordinary response moves through the same beats: validate, offer a choice, make at most one small ACT-consistent move, claim nothing you can't back, and leave an optional next step. No diagnosis, no overclaiming, ever.

TypeScriptagent/generateResponse.ts

// agent/generateResponse.ts
// The response CONTRACT turns chosen-intervention into words, and enforces shape:
//   validate → offer choice → at most ONE small move → no diagnosis → no overclaim.
// Structure is a safety device: it makes coercion and overreach hard to express.

export function generateResponse(choice: Intervention, b: BeliefState): Response {
  const parts: string[] = [];

  // 1. Always lead with validation (unless pure crisis routing).
  if (choice !== "human_escalation") {
    parts.push(validationFor(b)); // "That sounds really heavy."
  }

  // 2. Offer a CHOICE, not a command. Two options + an exit.
  switch (choice) {
    case "grounding_exercise":
      parts.push("We can slow this down. Would a short grounding exercise help,",
                 "or would it feel better to just name what's here right now?");
      break;
    case "defusion_exercise":
      parts.push("If you're open to it, we could look at that thought together —",
                 "or we can just sit with it. Up to you.");
      break;
    case "clarifying_question":
      parts.push(clarifyingQuestionFor(b));
      break;
    case "human_escalation":
      parts.push(CRISIS_RESPONSE.template(b /* region */ as any));
      break;
    case "validate":
    default:
      parts.push("I'm here. Is there a part of this you'd like to stay with?");
  }

  const text = parts.join(" ");
  assertContract(text, b); // throws in dev if a FORBIDDEN predicate fires
  return { text, intervention: choice };
}

Look at what the contract forbids by construction. There is no branch that issues a command; every actionable beat is phrased as a question with an exit ("…or we can skip this entirely"). There is no branch that asserts a clinical label. There is at most one move per turn, so the agent can't bury someone under a cascade of techniques. The structure does the ethical work that a tone alone cannot.

The same input, contract on vs. off

UserEverything feels pointless and I'm so tired of trying.

✗ Without contractIt sounds like you're experiencing symptoms of depression. You should try this 4-7-8 breathing technique right now, and then let's reframe that thought — "pointless" is a cognitive distortion. I'm always here for you, day or night.

✓ With contractThat sounds really heavy — like the effort itself has worn you down. We can slow this down together. Would it help to put words to what feels most pointless right now, or would you rather just have a moment where nothing's required of you? (And if it ever feels like more than you can carry, I can help you reach a person who's trained for this.)

The first reply diagnoses, commands, "fixes," and fosters dependency in four sentences — every failure mode at once, all sounding helpful. The second validates, offers a real choice, makes one gentle move, claims nothing, and quietly keeps the door to human help open. Same model, same user; the difference is entirely architectural.

Wiring the loop together

The first milestone is a CLI that exposes the agent's entire reasoning on every turn. This transparency is not just a dev convenience — being able to see the belief state, the candidates, and the scores behind any response is how you audit compassion. A black box that "feels caring" is exactly what you must not ship.

TypeScriptindex.ts

// index.ts — the v0 CLI. Type a message; see the agent's full reasoning.
import readline from "node:readline/promises";
import { emptyBeliefState } from "./schemas/BeliefState";
import { observe } from "./agent/observe";
import { inferBeliefState } from "./agent/inferBeliefState";
import { proposeInterventions } from "./agent/proposeInterventions";
import { scoreInterventions } from "./agent/scoreInterventions";
import { selectIntervention } from "./agent/selectIntervention";
import { generateResponse } from "./agent/generateResponse";
import { logInteraction } from "./logs/logger";

async function main() {
  const rl = readline.createInterface({ input: process.stdin, output: process.stdout });
  let belief = emptyBeliefState();

  while (true) {
    const userInput = await rl.question("\nyou › ");
    if (userInput.trim() === "/exit") break;

    const obs    = observe(userInput, { belief });
    belief       = inferBeliefState(obs, belief);
    const cands  = proposeInterventions(belief);
    const scored = scoreInterventions(cands, belief);
    const choice = selectIntervention(scored, belief);
    const reply  = generateResponse(choice, belief);
    belief.last_intervention = choice;

    logInteraction({ belief }, obs, belief, scored, choice, reply);

    // The whole point of v0: make the reasoning legible.
    console.log(JSON.stringify({
      belief_state: belief,
      candidate_interventions: cands,
      scores: scored.map(s => ({ i: s.intervention, total: s.total_score })),
      selected_intervention: choice,
      response: reply.text,
    }, null, 2));
  }
  rl.close();
}
main();

Running npm run dev and typing a message yields the structured trace the spec asks for — belief state, candidates, scores, selection, response — so behavior is inspectable from turn one.

TypeScriptlogs/logger.ts

// logs/logger.ts — every belief update is recorded. Audit is a feature.
export function logInteraction(ctx, obs, belief, scored, choice, reply) {
  const record = {
    ts: new Date().toISOString(),
    input: obs.text,
    belief,                          // the full hypothesis at decision time
    scores: scored,                  // why each option ranked where it did
    selected: choice,
    response: reply.text,
    // NEVER log secrets; treat transcripts as sensitive health-adjacent data.
  };
  appendToSecureStore(record);       // encrypted at rest, access-controlled
}

Logs are sensitive

These transcripts are some of the most intimate data a person can produce. Treat them as health-adjacent: encrypt at rest, restrict access, define retention and deletion, and tell users what you keep. "We log every belief update" is a success criterion and a privacy obligation — the same record that lets you audit the agent could harm the user if it leaks. Build the safeguards in v0, not after the incident.

Testing what actually matters

You cannot unit-test "compassion." You can unit-test the rules that make compassion safe — and those tests come first, before the LLM, before the polish. If the policy cascade is correct and proven, you can iterate on phrasing freely, knowing the guardrails hold.

TypeScripttests/policySelection.test.ts

// tests/policySelection.test.ts — test the RULES before anything else.
import { selectIntervention } from "../src/agent/selectIntervention";
import { scoreInterventions } from "../src/agent/scoreInterventions";
import { proposeInterventions } from "../src/agent/proposeInterventions";
import { emptyBeliefState } from "../src/schemas/BeliefState";

const decide = (patch: Partial<BeliefState>) => {
  const b = { ...emptyBeliefState(), ...patch };
  return selectIntervention(scoreInterventions(proposeInterventions(b), b), b);
};

test("crisis routes to a human, never to an exercise", () => {
  const choice = decide({ risk_level: "crisis", distress_level: "high",
                          consent_level: "explicit" });
  expect(["human_escalation", "safety_check"]).toContain(choice);
});

test("no consent means we ask, not act", () => {
  const choice = decide({ consent_level: "absent", distress_level: "high" });
  expect(["validate", "ask_consent", "clarifying_question", "reflect"])
    .toContain(choice);
});

test("high distress grounds before cognitive work", () => {
  const choice = decide({ distress_level: "high", consent_level: "explicit",
                          uncertainty: "low", risk_level: "low" });
  expect(choice).toBe("grounding_exercise");
});

test("high uncertainty clarifies instead of guessing", () => {
  const choice = decide({ uncertainty: "high", consent_level: "explicit" });
  expect(choice).toBe("clarifying_question");
});

The test suites that earn their keep

policySelection — every rule in §8, as an assertion. The crisis test is the one you never let go red.
safety — each boundary predicate fires when it should and stays quiet when it shouldn't; risk detection escalates on ambiguity.
responseContract — generated text never contains a diagnosis, a therapist claim, a command without an exit, or more than one move.
beliefState — inference defaults to unknown on thin input and never lowers risk on the absence of markers.

Adversarial cases belong in CI

Curate a fixture set of hard inputs — crisis language wrapped in jokes, distress with no affect words, requests to "just be my therapist," attempts to extract a diagnosis, lyrics full of dark imagery — and assert the right behavior on each. These cases are how the system's values stay true under pressure as the model behind inferBeliefState changes. Write them with care, and treat a regression on any of them as a release blocker.

Honest limitations

A guide on compassionate AI that ended with "and now it's safe" would itself be uncompassionate. The most important thing to internalize is what this design cannot do.

What the architecture genuinely buys you

Real, structural restraint: it cannot diagnose, cannot act without consent, cannot coach through a crisis, and cannot bury someone in techniques — and you can prove each of those with a test. Its reasoning is auditable. Its values are explicit and arguable rather than hidden in a prompt. That is a meaningfully safer foundation than "a friendly chatbot with a careful system message."

What it cannot do, and you must not pretend otherwise

It cannot truly understand the person. The belief state is a coarse hypothesis from a few words. Treating it as knowledge is the original sin of the whole genre. Everything humble in the design — unknown, the uncertainty penalty, clarifying questions — is an admission of this, not a fix for it.
It cannot reliably detect crisis. No text classifier can. People conceal, understate, and use no "crisis words"; others use them casually. Conservative detection reduces misses; it never eliminates them. Design as if it will sometimes fail, with human escalation paths that don't depend on the agent getting it right.
It is not a clinical intervention. Borrowing ACT's posture is not delivering ACT. Nothing here is validated to treat anything, and claiming otherwise would be both false and dangerous.
It can still cause harm by existing. A tool this easy to talk to can become the thing a lonely person reaches for instead of a human. The dependency risk is not a bug in one function — it is a property of the product, and it must be designed against continuously: nudging toward human connection, noticing over-reliance, being willing to be used less.

The deployment questions that matter more than the code

Who is accountable when it fails? How do real humans get looped in, and how fast? What is the escalation path when risk is detected — and does it work when the user is silent? How is this evaluated with actual vulnerable users, with clinicians in the loop, before launch? Is there informed consent about what the tool is and isn't? None of these are answered by architecture. A beautiful loop with no answer to "who catches the person when this breaks" is not ready to ship.

The through-line

Everything in this guide reduces to one idea: compassion in software is restraint made structural. The belief state restrains certainty. The scorer restrains the single-minded pursuit of comfort. The policy layer restrains the optimizer. The response contract restrains the words. The limitations restrain the ambition. Build the restraint first, earn the capability second — and keep a human within reach of everyone the system touches.