This study will test whether artificial intelligence (AI) can help doctors diagnose a rare blood cancer called acute promyelocytic leukemia (APL) more quickly and accurately. Doctors usually examine bone marrow samples under a microscope to make this diagnosis, but it can be challenging and time-consuming. In this study, doctors will review bone marrow samples under three different conditions: * Unaided Review: Without AI assistance. * AI as Double-Check: AI-generated evaluation shown after the doctor makes an initial decision. * AI as First Look: AI-generated evaluation shown at the start of the review. Doctors will be randomly assigned to different orders of these three conditions. This design will allow us to compare how AI support affects diagnostic accuracy, speed, and confidence.
This study aims to evaluate the effect of artificial intelligence (AI) assistance on clinicians' diagnostic performance in detecting acute promyelocytic leukemia (APL) using Wright-Giemsa-stained bone marrow whole-slide images (WSIs). The Leukemia End-to-End Analysis Platform (LEAP) will serve as the AI model under assessment. This is a single-session, within-reader study. Participants will be randomly assigned to one of two study arms, which differ in the order of diagnostic blocks: \* Arm 1 (X -\> Y): Block X (Unaided Review): Clinicians review WSIs without AI support. Diagnostic accuracy, time to decision, and confidence will be recorded. Block Y (AI-Assisted Review): Comprising two sub-blocks presented in randomized order: Y1 (AI as Double-Check): Clinicians provide an initial diagnosis and confidence score without the aid of AI. AI predictions are then revealed, and clinicians may revise their diagnosis. Both pre-AI and post-AI decisions will be recorded. Y2 (AI as First Look): Clinicians review WSIs with AI-predicted diagnoses visible from the beginning. \* Arm 2 (Y -\> X): Block Y (AI-Assisted Review): Sub-blocks Y1 and Y2 presented in randomized order. Block X (Unaided Review): As described above. Each clinician will review 102 de-identified WSIs. For each reader, slides will be randomly divided into three disjoint subsets (e.g. 34/34/34), stratified by APL status, and assigned to Block X (Unaided), Block Y1 (AI as Double-Check), or Block Y2 (AI as First Look). No slide will be shown to the same reader in more than one block. In addition, the AI system will independently generate diagnostic predictions for all WSIs to enable benchmarking; however, this does not constitute a participant arm. Ground-truth diagnoses will be determined by molecular confirmation and expert consensus.
Study Type
INTERVENTIONAL
Allocation
RANDOMIZED
Purpose
DIAGNOSTIC
Masking
TRIPLE
Enrollment
10
Readers first complete Block X (Unaided) on their assigned subset SX (34 slides). They then complete Block Y (AI-Assisted) on two separate subsets: SY1 (34 slides; AI as Double-Check) and SY2 (34 slides; AI as First Look). Within Block Y, the order of Y1 and Y2 is randomized. For each reader, SX, SY1, and SY2 are disjoint and stratified by APL status.
Readers first complete Block Y (AI-Assisted) on two assigned subsets: SY1 (up to 40 slides; AI as Double-Check) and SY2 (up to 40 slides; AI as First Look), with the order of Y1 and Y2 randomized. They then complete Block X (Unaided) on subset SX (up to 40 slides). For each reader, SX, SY1, and SY2 are disjoint and stratified by APL status.
Harvard Medical School
Boston, Massachusetts, United States
Diagnostic performance of APL detection
Performance of clinicians (unaided and AI-assisted) in detecting APL, measured in accuracy, sensitivity, specificity, positive predictive value, and negative predictive value.
Time frame: Periprocedural (at the time of slide review)
Time to diagnosis
Average time (seconds per case) required to finalize a diagnosis.
Time frame: Periprocedural (at the time of slide review)
Inter-observer variability
Agreement among clinicians across conditions, measured using inter-rater reliability metrics (e.g., kappa statistics).
Time frame: Periprocedural (at the time of slide review)
Concordance between AI predictions and clinicians' diagnoses
The proportion of cases in which AI predictions match clinicians' decisions in each study condition.
Time frame: Periprocedural (at the time of slide review)
Decision-change rates
The proportion of cases in which a clinician's initial diagnosis is revised after exposure to AI assistance.
Time frame: Periprocedural (at the time of slide review)
Net benefit after AI exposure
The overall change in diagnostic accuracy attributable to AI assistance.
Time frame: Periprocedural (at the time of slide review)
Clinician confidence level
Self-reported diagnostic confidence recorded for each case. Scale: 5 - Absolutely Certain; 4 - Mostly Certain; 3 - Unsure; 2 - Very Doubtful; 1 - Random Guess; With 5 being the highest confidence score and 1 being the lowest.
Time frame: Periprocedural (at the time of slide review)
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.