Clinical decision support tools powered by artificial intelligence are being rapidly integrated into medical practice. Two leading systems currently available to clinicians are OpenEvidence, which uses retrieval-augmented generation to access medical literature, and GPT-4, a large language model. While both tools show promise, their relative effectiveness in supporting clinical decision-making has not been directly compared. This study aims to evaluate how these tools influence diagnostic reasoning and management decisions among internal medicine physicians.
Internal medicine attendings and residents are invited to participate in a study investigating how physicians using a RAG-based LLM (OpenEvidence) perform compared to those using a standard general-purpose LLM (ChatGPT) on both diagnostic reasoning and complex management decisions. As AI tools increasingly enter clinical practice, evidence is needed about which approaches best support physician decision-making. This study will help determine if specialized medical knowledge retrieval systems (OpenEvidence) provide advantages over general AI assistants (ChatGPT) when solving real clinical cases. Participants will complete one 90-minute Zoom session where clinical cases derived from real, de-identified patient encounters will be solved. Participants will be randomly assigned to use either OpenEvidence or ChatGPT and all responses evaluated by blinded scorers using a validated rubric. Note that this exempted study will compare OpenEvidence (as opposed to Clinical Key AI) vs ChatGPT although the official study title suggests otherwise.
Study Type
INTERVENTIONAL
Allocation
RANDOMIZED
Purpose
OTHER
Masking
SINGLE
Enrollment
27
Medical information platform which uses retrieval-augmented generation to access medical literature
A chatbot application developed that uses GPT-4, a large language model, to engage in conversational interactions with users.
Harvard Beth Israel Deaconess Medical Center
Boston, Massachusetts, United States
MontefioreMC
The Bronx, New York, United States
Clinical Reasoning Performance as determined by Rater Scores
Clinical reasoning performance will be evaluated based upon the accuracy of the rater scores to responses to the surveys administered. Six blinded, trained independent raters will independently score each participant's response using a validated scoring rubric. Possible response scores can range from 0-100% with higher scores indicating increased clinical reasoning performance. Results for each assessment will be summarized by study arm using basic descriptive statistics and analyzed using mixed-effects models to account for within-subject correlation and between-subject factors.
Time frame: 15-minutes upon completion of cases, up to approximately 90 minutes total
Time efficiency
Time efficiency will be assessed based on the amount of time it takes for participants to complete the surveys. Each survey will automatically be time stamped to record the amount of time needed for each participant to answer each case. Results for the virtual session will be summarized by study arm using basic descriptive statistics and analyzed.
Time frame: Up to approximately 75 minutes
Decision confidence
Decision confidence will be determined by asking participants to assess the level of confidence in survey answers using a scale ranging from 1-5 (1 being least confident, 5 being most confident) such that higher scores are associated with increased confidence in responses. Scores will be summarized by study arm using basic descriptive statistics.
Time frame: 15-minutes upon completion of cases, up to approximately 90 minutes total
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.