A parallel group randomized controlled trial using a superiority framework. Clinical vignettes will be used to assess the impact of a large language model on the clinical reasoning of physicians. Quantitative analyses will be performed on graded vignette responses.
This study is a multi-country, parallel-group randomized controlled trial designed to evaluate whether access to a large language model (LLM) improves physician clinical decision-making. The trial uses a superiority framework and compares physicians randomized to either complete standardized clinical vignettes with access to GPT-4o or without any AI assistance. Clinical vignettes simulate common primary care conditions such as cardiovascular, respiratory, musculoskeletal, fatigue-related, and infectious diseases. Each vignette includes multiple steps in the clinical reasoning process, from initial history-taking to diagnosis, treatment, and follow-up. Physician responses are graded using rubrics developed from evidence-based, context-specific best-practice guidelines. The study is conducted across three countries-Indonesia, Kenya, and the Netherlands-representing different income levels and health system contexts. The primary outcome is performance on clinical vignettes, defined as adherence to best-practice guidelines. Secondary objectives include examining cross-country variation in physician performance, variation in performance distributions, and the role of engagement with the LLM in shaping outcomes.
Study Type
INTERVENTIONAL
Allocation
RANDOMIZED
Purpose
DIAGNOSTIC
Masking
SINGLE
Enrollment
249
GPT-4o provided via an iFrame in the online Qualtrics environment
Universitas Indonesia
Jakarta, Indonesia
Aga Khan University Hospital
Nairobi, Kenya
Maastricht University
Maastricht, Netherlands
Percentage Correct Score
Following Peabody et al (2000), the primary outcome is a percentage correct score across all steps in a vignette. This is generated by dividing the weighted total sum of rubric items assessed as present by the total number of rubric items possible in a vignette. Rubric items will be weighted with regards to their relevance by our expert panel.
Time frame: During Evaluation
Quality Per Answer
This outcome is generated as the average weight of rubric items assessed as present across vignettes. As each item is provided a weight (0.33,0.5,1), the average weight is the sum of weights divided by the number of answers marked as present.
Time frame: During Evaluation
Number of Answers
This outcome is generated as the count of the total number of answers assessed as present by reviewers per vignette
Time frame: During evaluation
Less obvious answers
This outcome is generated as the number of answers given that are less obvious, i.e. mentioned less frequently by the control group. If the answer is mentioned by 25% or less of the control group, it is considered less obvious.
Time frame: During evaluation
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.