Scalable Clinical Oversight of Large Language Models Via Uncertainty Triangulation

N/ANot Yet RecruitingNCT07414966

China National Center for Cardiovascular Diseases7 enrolled

Overview

This prospective, multi-reader, randomized crossover trial evaluates SCOUT (Scalable Clinical Oversight via Uncertainty Triangulation), a model-agnostic meta-verification framework that selectively defers unreliable large language model (LLM) predictions to clinicians by triangulating three orthogonal uncertainty signals: model heterogeneity, stochastic inconsistency, and reasoning critique. The trial assesses whether SCOUT-assisted review can reduce physician review time compared with standard manual review of AI-generated diagnoses while maintaining non-inferior diagnostic accuracy in coronary heart disease (CHD) subtyping.

Background: Large language models are increasingly deployed in clinical workflows, yet requiring clinician review of every AI output negates the efficiency gains that motivate their adoption. SCOUT addresses this efficiency-safety paradox through algorithmic meta-verification. The SCOUT framework triangulates three orthogonal external signals to determine case-level uncertainty: (1) Model Heterogeneity - whether a structurally different auxiliary LLM agrees with the primary model; (2) Stochastic Inconsistency - whether repeated sampling from the same model yields divergent outputs; (3) Reasoning Critique - whether an external checker model identifies logical flaws in the chain-of-thought reasoning. In this crossover trial, 7 clinicians of varying seniority (2 junior residents, 3 senior residents, 2 attending physicians) each review all 110 cases under both standard manual review and SCOUT-assisted review workflows. The study evaluates workflow efficiency (primary endpoint) and diagnostic accuracy (secondary endpoint).

Study Type

INTERVENTIONAL

Allocation

RANDOMIZED

Purpose

DIAGNOSTIC

Masking

NONE

Enrollment

Interventions

SCOUT-Assisted Review WorkflowDIAGNOSTIC_TEST

SCOUT-Assisted Review (Intervention Arm): Physicians review 56 cases processed through the SCOUT framework. For cases classified as low-uncertainty (D(x)=0), the AI prediction is auto-accepted without physician review. For high-uncertainty cases (D(x)=1), the physician reviews the case with access to the main model's chain-of-thought reasoning and the meta-verification audit results. The main model is DeepSeek-V3.1 with chain-of-thought prompting.

Standard Manual Review WorkflowDIAGNOSTIC_TEST

Physicians perform a full manual review of 54 cases using raw medical records with access to the AI model's predictions and reasoning, but without SCOUT uncertainty stratification or selective deferral.

Outcomes

Primary Outcomes

Mean physician review time per case (minutes)

Mean time spent by each clinician reviewing and rendering a diagnostic decision per case under each arm. Measured in minutes.

Time frame: Through study completion, an average of 2 hours.

Secondary Outcomes

Diagnostic accuracy (%)

Proportion of correct CHD subtype classifications (STEMI, NSTEMI, unstable angina, chronic coronary syndromes) under each arm.

Time frame: Through study completion, an average of 2 hours.

Computational Return on Investment (ROI)

Ratio of physician time savings (valued at standardized minute-wages from Sanming healthcare reform benchmarks) to computational cost of SCOUT inference, stratified by clinician seniority level.

Time frame: Through study completion, an average of 2 hours.

Scalable Clinical Oversight of Large Language Models Via Uncertainty Triangulation

Overview

Conditions

Interventions

Eligibility

Outcomes

Primary Outcomes

Secondary Outcomes

Central Contacts