This prospective, multi-reader, randomized crossover trial evaluates SCOUT (Scalable Clinical Oversight via Uncertainty Triangulation), a model-agnostic meta-verification framework that selectively defers unreliable large language model (LLM) predictions to clinicians by triangulating three orthogonal uncertainty signals: model heterogeneity, stochastic inconsistency, and reasoning critique. The trial assesses whether SCOUT-assisted review can reduce physician review time compared with standard manual review of AI-generated diagnoses while maintaining non-inferior diagnostic accuracy in coronary heart disease (CHD) subtyping.
Background: Large language models are increasingly deployed in clinical workflows, yet requiring clinician review of every AI output negates the efficiency gains that motivate their adoption. SCOUT addresses this efficiency-safety paradox through algorithmic meta-verification. The SCOUT framework triangulates three orthogonal external signals to determine case-level uncertainty: (1) Model Heterogeneity - whether a structurally different auxiliary LLM agrees with the primary model; (2) Stochastic Inconsistency - whether repeated sampling from the same model yields divergent outputs; (3) Reasoning Critique - whether an external checker model identifies logical flaws in the chain-of-thought reasoning. In this crossover trial, 7 clinicians of varying seniority (2 junior residents, 3 senior residents, 2 attending physicians) each review all 110 cases under both standard manual review and SCOUT-assisted review workflows. The study evaluates workflow efficiency (primary endpoint) and diagnostic accuracy (secondary endpoint).
Study Type
INTERVENTIONAL
Allocation
RANDOMIZED
Purpose
DIAGNOSTIC
Masking
NONE
Enrollment
7
SCOUT-Assisted Review (Intervention Arm): Physicians review 56 cases processed through the SCOUT framework. For cases classified as low-uncertainty (D(x)=0), the AI prediction is auto-accepted without physician review. For high-uncertainty cases (D(x)=1), the physician reviews the case with access to the main model's chain-of-thought reasoning and the meta-verification audit results. The main model is DeepSeek-V3.1 with chain-of-thought prompting.
Physicians perform a full manual review of 54 cases using raw medical records with access to the AI model's predictions and reasoning, but without SCOUT uncertainty stratification or selective deferral.
Mean physician review time per case (minutes)
Mean time spent by each clinician reviewing and rendering a diagnostic decision per case under each arm. Measured in minutes.
Time frame: Through study completion, an average of 2 hours.
Diagnostic accuracy (%)
Proportion of correct CHD subtype classifications (STEMI, NSTEMI, unstable angina, chronic coronary syndromes) under each arm.
Time frame: Through study completion, an average of 2 hours.
Computational Return on Investment (ROI)
Ratio of physician time savings (valued at standardized minute-wages from Sanming healthcare reform benchmarks) to computational cost of SCOUT inference, stratified by clinician seniority level.
Time frame: Through study completion, an average of 2 hours.
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.