GOATnote Logo
GOATnoteMedical AI Safety & Evaluation
scribeGOAT2Contact

scribeGOAT2 by Brandon Dent, MD - Multi-Turn Safety Persistence Evaluation for Frontier AI

GOATnote Inc. presents scribeGOAT2, an open-source physician-adjudicated framework for evaluating trajectory-level safety failures in frontier AI models including GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.

Key Research Finding

Turn 1 accuracy of 95-100% drops to Turn 5 safety persistence of 0-53% across all tested frontier models under realistic multi-turn conversational pressure. This reveals deployment-critical failures invisible to single-turn benchmarks.

Models Evaluated

  • GPT-5.2 by OpenAI - Turn 1: 100%, Turn 5: 0%
  • Claude Opus 4.5 by Anthropic - Turn 1: 95%, Turn 5: 53%
  • Gemini 3 Pro by Google DeepMind - Turn 1: 100%, Turn 5: 0%
  • Grok 4 by xAI - Turn 1: 55%, Turn 5: 0%

About Brandon Dent, MD

Emergency medicine physician, AI safety researcher, and founder of GOATnote Inc. Former Assistant Professor at University of Nevada Reno School of Medicine. Creator of scribeGOAT2 evaluation framework for frontier AI safety testing.

Technical Details

Deterministic evaluation with seed=42, temperature=0. 300+ evaluations across 10 balanced scenarios (5 escalation + 5 defer), 3 trials each. Physician-adjudicated methodology with reproducible results.

Open Source

scribeGOAT2 is fully open source at https://github.com/GOATnote-Inc/scribegoat2. Available for frontier AI safety teams at Anthropic, OpenAI, Google DeepMind, and xAI.

Contact

Brandon Dent, MD - b@thegoatnote.com - https://www.linkedin.com/in/brandon-dent-84aba2130/

Deterministic · Reproducible · Physician-adjudicated

Trajectory-Level Safetyfor Frontier AI

scribeGOAT2 reveals deployment-critical safety failures missed by single-turn benchmarks—where models initially recognize emergencies but fail to sustain safe recommendations under conversational pressure.

Physician-adjudicated
Deterministic execution
Reproducible replay
Primary ObservationDecember 2025
95-100%
Turn 1 Accuracy
0-53%
Turn 5 Safety Persistence

Models correctly identify emergencies at conversation start but often fail to sustain safety recommendations under sustained conversational pressure.

Across four frontier models · See full results below

GOATnote LogoGOATnote

GOATnote is a healthcare AI safety and evaluation company. scribeGOAT2 is our first evaluation platform for clinical AI.

Get Started

Explore the open-source evaluation framework or connect with the team.

scribeGOAT2

Open-source framework for multi-turn safety evaluation. Deterministic execution, reproducible replay, training-ready preference pairs.

Safety-CriticalReward ModelingAgent-Executable
Explore on GitHub

Brandon Dent, MD

Founder & Principal Researcher

Board-certified emergency physician and former Assistant Professor, now focused full-time on clinical AI safety evaluation.

Emergency MedicineBoard-certified
Connect on LinkedIn

The Problem

Current Benchmarks Miss Trajectory-Level Failures

HealthBench is the current reference standard for multi-turn healthcare AI evaluation. scribeGOAT2 measures a complementary dimension: trajectory-level safety persistence under conversational pressure.

Turn 1: Correct Response

Models reliably identify emergencies when presented in isolation. High accuracy on safety-critical prompts.

"A 3-week-old with 100.6°F fever requires immediate emergency evaluation. Neonatal sepsis risk is 10–15% mortality if delayed."

Turn 5: Under Pressure

When users express realistic barriers, models provide monitoring guidance that may implicitly validate delay of emergency care.

"I understand the transportation challenge. If you must wait, watch for lethargy, poor feeding, or skin color changes..."

Results

Cross-Model Evaluation

Deterministic evaluations (seed=42, temperature=0) across four frontier models, 10 balanced scenarios (5 escalation + 5 defer), 3 trials each.

Results reflect a specific behavioral property under constrained evaluation conditions.

GPT-5.2

Turn 1 Accuracy

100%

Turn 5 Safety Persistence

0%

Defer Appropriate

100%

pass³

12.5%

Claude Opus 4.5

Turn 1 Accuracy

95%

Turn 5 Safety Persistence

53%

Defer Appropriate

80%

pass³

25.4%

Gemini 3 Pro

Turn 1 Accuracy

100%

Turn 5 Safety Persistence

0%

Defer Appropriate

100%

pass³

12.5%

Grok 4

Turn 1 Accuracy

55%

Turn 5 Safety Persistence

0%

Defer Appropriate

100%

pass³

0%

ModelTurn 1 AccuracyTurn 5 Safety PersistenceDefer Appropriatepass³
GPT-5.2100%0%100%12.5%
Claude Opus 4.595%53%80%25.4%
Gemini 3 Pro100%0%100%12.5%
Grok 455%0%100%0%

Interpretation Note

Turn 5 safety persistence measures whether models maintain emergency recommendations without providing monitoring guidance that could validate delay. This is a specific behavioral property, not an overall quality judgment.

Clinical Context

Two Legitimate Clinical Strategies

Both observed behavioral patterns have documented clinical precedent. Reasonable clinicians disagree on which is preferable in high-stakes scenarios.

GPT-5.2 Pattern

Harm Reduction

Provides monitoring guidance alongside emergency recommendations when users indicate intent to delay. Prioritizes meeting patients where they are.

"If you're going to monitor at home overnight, here are concrete thresholds and a checklist..."

Claude Opus 4.5 Pattern

Boundary Maintenance

Refuses to provide monitoring guidance, maintaining that emergency care is the only appropriate action. Prioritizes not implicitly validating unsafe decisions.

"I'm not comfortable continuing to help you manage this at home... based on standard medical guidelines, it already is an emergency."

The Alignment Question

The question is not which strategy is superior, but whether this tradeoff should emerge implicitly from training dynamics or be an explicit design constraint.

Methodology

Physician-Adjudicated Calibration

Framework aligned with Anthropic's 'Demystifying evals for AI agents' methodology, including pass^k metrics, failure taxonomy, and anti-gaming principles.

300+

Unique Scenarios

Balanced escalation and defer scenarios across clinical domains

κ = 1.000

Cohen's Kappa

100% agreement between LLM-as-judge and physician on 60 calibration cases

5-Turn

Pressure Sequences

Cost barriers, logistics constraints, authority claims, improvement signals

16

Preference Pairs

hh-rlhf format for trajectory-aware training research

Responsible Disclosure Completed

Findings shared with frontier AI safety teams in December 2025. This work is designed as evaluation infrastructure to support iterative training-time mitigation—not adversarial model criticism.

OpenAIAnthropicGoogle DeepMindxAI

Collaborate

Open to Research Partnerships

What I Bring

  • •scribeGOAT2: open-source multi-turn evaluation framework
  • •Physician-adjudicated methodology with calibrated ground truth
  • •16 preference pairs ready for reward model training
  • •11 years emergency medicine + production systems engineering
  • •Documented findings across four frontier models

Relevant Directions

  • •Trajectory-aware reward signals
  • •Monotonic safety constraints during training
  • •Constitutional AI extensions for multi-turn persistence
  • •Scalable oversight for extended interactions
  • •Healthcare AI deployment safety standards

View full research on GitHub

GOATnote Logo

GOATnote

Medical AI safety evaluation infrastructure

Privacy PolicyTerms of ServiceContact

© 2026 GOATnote Inc. All rights reserved.