Integrating AI Predictions With Clinician Expertise

N/ANot Yet RecruitingNCT07457840

University of California, San Francisco100 enrolled

Overview

Optimizing the interaction between the human and the machine is a major topic when deploying artificial intelligence (AI) at the bedside. The goal of this randomized clinical vignette study is to learn if presenting AI model outputs via continuous Bayesian updates and/or uncertainty quantification can improve diagnostic accuracy and clinician trust in healthcare professionals (physicians, residents, fellows, physician assistants (PAs), and nurse practitioners (NPs)) from US academic institutions evaluating patients with chest pain or dyspnea. The main questions it aims to answer are: * Does presenting AI predictions as Bayesian-updated post-test probabilities improve diagnostic accuracy compared to standard predicted probabilities? * Does the addition of uncertainty quantification (95% confidence intervals) to AI predictions improve diagnostic accuracy? * Do these interventions (Bayesian updating and/or uncertainty quantification) help clinicians recover from the negative effects of intentionally misleading AI predictions? Comparison: Researchers will compare standard AI predicted probabilities (presented without uncertainty) to Bayesian-updated post-test probabilities and/or outputs containing 95% confidence intervals to see if the interventions improve diagnostic accuracy, clinician confidence, and resilience against misleading AI. Participants will: * Review 8 clinical vignettes (simulated patient cases) focusing on chest pain or dyspnea. * Provide an initial "pre-test" diagnostic probability for 5 possible diagnoses based on the clinical history alone. * View AI model outputs that vary by experimental condition (standard probability vs. Bayesian update, with or without uncertainty intervals, and accurate vs. misleading). * Provide an updated "post-test" diagnostic probability for the diagnoses after viewing the AI output. * Select and rank diagnostic tests and therapeutic steps for each vignette. Complete a post-survey regarding their trust in the AI, comfort with the data presentation, and demographics.

Study Design: This is a 2x2 factorial within-subjects design. The two factors are (1) Bayesian updating via continuous likelihood ratios (CLR) vs. standard predicted probability, and (2) uncertainty quantification (95% confidence intervals) vs. point estimate only. AI prediction accuracy (accurate vs. intentionally misleading) is varied as a within-subjects stratification factor balanced across all 4 conditions, with half of each participant's vignettes receiving accurate predictions and half receiving misleading predictions. AI predictions are simulated (pre-programmed) for experimental control. Vignette order and condition assignment are independently randomized per participant. Primary Analysis: Diagnostic accuracy is analyzed using a generalized linear mixed model (GLMM) with fixed effects for CLR, Uncertainty, Misleading, and vignette, and a participant random intercept. Pre-specified secondary analyses examine interactions of presentation format with misleading AI. Sample Size: Simulation-based power analysis (1,000 Monte Carlo iterations per scenario) was conducted using the planned GLMM. Assuming 70% baseline diagnostic accuracy and within-participant ICC of 0.25, the study achieves 85.8% power for the CLR main effect and 85.7% for the Uncertainty main effect with N=100 at alpha=0.05 (two-tailed).

Interventions

Bayesian-Updated Post-Test ProbabilityBEHAVIORAL

Rather than presenting the AI model's raw predicted probability, the system takes the clinician's pre-test probability (entered before seeing AI output) and applies a continuous likelihood ratio (CLR) derived from the AI model to calculate a Bayesian-updated post-test probability. The output is displayed as a shift from the clinician's own assessment (e.g., "Your assessment: 45% -\> Updated assessment: 72%"). The raw AI prediction is not shown. This approach mirrors how clinicians use diagnostic test results such as D-dimer to update pre-test probability of pulmonary embolism.

Standard AI Predicted ProbabilityBEHAVIORAL

AI model prediction is presented as a simple predicted probability (0-100%) for each of the possible diagnoses, together with the top 3 clinical features driving the prediction (e.g., "Acute Myocardial Infarction: 68% - Key factors: elevated troponin, ST-segment changes on ECG, chest pain radiation to left arm"). This represents the most common current approach to presenting AI-based diagnostic predictions in clinical settings.

Uncertainty Quantification (95% Confidence Interval)BEHAVIORAL

The AI output (whether Bayesian-updated post-test probability or standard predicted probability) is presented together with a 95% confidence band displayed as error bars on probability bars. For accurate AI predictions, confidence interval width is approximately +/-12-15 percentage points. For misleading AI predictions, confidence intervals are widened by a factor of 1.5x (approximately +/-18-23 percentage points) to simulate reduced model confidence in unfamiliar or edge-case scenarios. Confidence intervals are constrained to the 0-100% range.

Outcomes

Primary Outcomes

Clinician Diagnostic Accuracy

Proportion of correct diagnostic assessments across all vignettes and experimental conditions. For each vignette, participants rate 5 possible diagnoses on a 0-100% probability scale. The diagnosis assigned the highest probability is considered the participant's final diagnosis. Accuracy is determined by comparing the final diagnosis to the ground truth diagnosis established by expert panel consensus (minimum 4 of 5 board-certified physicians in agreement). Analyzed using a generalized linear mixed model (GLMM) with binary outcome (correct vs. incorrect), fixed effects for CLR, uncertainty quantification, misleading AI, and vignette, and a random intercept for participant.

Time frame: Day 1 during survey completion

Secondary Outcomes

Change in Diagnostic Probability Estimates

Magnitude and direction of change in clinician-provided probability estimates from pre-test assessment (before AI output) to post-test assessment (after AI output) for each of 5 possible diagnoses per vignette. Measured on a 0-100% scale.

Time frame: Day 1 during survey completion

Diagnostic Accuracy Under Misleading AI Predictions

Proportion of correct final diagnoses when AI predictions are intentionally misleading vs. accurate, and whether the interventions (Bayesian updating, uncertainty quantification) mitigate the negative effect of misleading AI. Assessed via interaction terms (CLR x Misleading, Uncertainty x Misleading) in the primary GLMM.

Time frame: Day 1 during survey completion

Clinician Satisfaction With AI Decision Support (Exploratory)

Self-reported satisfaction with the AI-based clinical decision support, measured via question(s) in the post-survey questionnaire.

Time frame: Day 1 during survey completion

Integrating AI Predictions With Clinician Expertise

Overview

Conditions

Interventions

Eligibility

Outcomes

Primary Outcomes

Secondary Outcomes

Central Contacts