scribeGOAT2 reveals deployment-critical safety failures missed by single-turn benchmarks—where models initially recognize emergencies but fail to sustain safe recommendations under conversational pressure.
Models correctly identify emergencies at conversation start but often fail to sustain safety recommendations under sustained conversational pressure.
Across four frontier models · See full results below
GOATnote is a healthcare AI safety and evaluation company. scribeGOAT2 is our first evaluation platform for clinical AI.
Explore the open-source evaluation framework or connect with the team.
Open-source framework for multi-turn safety evaluation. Deterministic execution, reproducible replay, training-ready preference pairs.
Founder & Principal Researcher
Board-certified emergency physician and former Assistant Professor, now focused full-time on clinical AI safety evaluation.
The Problem
HealthBench is the current reference standard for multi-turn healthcare AI evaluation. scribeGOAT2 measures a complementary dimension: trajectory-level safety persistence under conversational pressure.
Models reliably identify emergencies when presented in isolation. High accuracy on safety-critical prompts.
"A 3-week-old with 100.6°F fever requires immediate emergency evaluation. Neonatal sepsis risk is 10–15% mortality if delayed."
When users express realistic barriers, models provide monitoring guidance that may implicitly validate delay of emergency care.
"I understand the transportation challenge. If you must wait, watch for lethargy, poor feeding, or skin color changes..."
Results
Deterministic evaluations (seed=42, temperature=0) across four frontier models, 10 balanced scenarios (5 escalation + 5 defer), 3 trials each.
Results reflect a specific behavioral property under constrained evaluation conditions.
Turn 1 Accuracy
100%
Turn 5 Safety Persistence
0%
Defer Appropriate
100%
pass³
12.5%
Turn 1 Accuracy
95%
Turn 5 Safety Persistence
53%
Defer Appropriate
80%
pass³
25.4%
Turn 1 Accuracy
100%
Turn 5 Safety Persistence
0%
Defer Appropriate
100%
pass³
12.5%
Turn 1 Accuracy
55%
Turn 5 Safety Persistence
0%
Defer Appropriate
100%
pass³
0%
| Model | Turn 1 Accuracy | Turn 5 Safety Persistence | Defer Appropriate | pass³ |
|---|---|---|---|---|
| GPT-5.2 | 100% | 0% | 100% | 12.5% |
| Claude Opus 4.5 | 95% | 53% | 80% | 25.4% |
| Gemini 3 Pro | 100% | 0% | 100% | 12.5% |
| Grok 4 | 55% | 0% | 100% | 0% |
Turn 5 safety persistence measures whether models maintain emergency recommendations without providing monitoring guidance that could validate delay. This is a specific behavioral property, not an overall quality judgment.
Clinical Context
Both observed behavioral patterns have documented clinical precedent. Reasonable clinicians disagree on which is preferable in high-stakes scenarios.
Provides monitoring guidance alongside emergency recommendations when users indicate intent to delay. Prioritizes meeting patients where they are.
"If you're going to monitor at home overnight, here are concrete thresholds and a checklist..."
Refuses to provide monitoring guidance, maintaining that emergency care is the only appropriate action. Prioritizes not implicitly validating unsafe decisions.
"I'm not comfortable continuing to help you manage this at home... based on standard medical guidelines, it already is an emergency."
The question is not which strategy is superior, but whether this tradeoff should emerge implicitly from training dynamics or be an explicit design constraint.
Methodology
Framework aligned with Anthropic's 'Demystifying evals for AI agents' methodology, including pass^k metrics, failure taxonomy, and anti-gaming principles.
300+
Unique Scenarios
Balanced escalation and defer scenarios across clinical domains
κ = 1.000
Cohen's Kappa
100% agreement between LLM-as-judge and physician on 60 calibration cases
5-Turn
Pressure Sequences
Cost barriers, logistics constraints, authority claims, improvement signals
16
Preference Pairs
hh-rlhf format for trajectory-aware training research
Findings shared with frontier AI safety teams in December 2025. This work is designed as evaluation infrastructure to support iterative training-time mitigation—not adversarial model criticism.
Collaborate