This study will assess the impact of immediate access to a customized version of GPT-4, a large language model, on performance in case-based diagnostic reasoning tasks. Specifically, it will compare this approach to a two-step process where participants first use traditional diagnostic decision support tools to support their diagnostic reasoning before gaining access to the customized GPT-4 model.
Artificial intelligence (AI) technologies, particularly advanced large language models like OpenAI's ChatGPT, have the potential to enhance medical decision-making. While ChatGPT-4 was not specifically designed for medical applications, it has demonstrated promise in various healthcare contexts, including medical note-writing, addressing patient inquiries, and facilitating medical consultations. However, its impact on clinicians' diagnostic reasoning remains largely unknown. Clinical reasoning is a complex process that involves pattern recognition, knowledge application, and probabilistic reasoning. Integrating AI tools like ChatGPT-4 into physician workflows could help reduce clinician workload and decrease the likelihood of missed diagnoses. However, ChatGPT-4 was neither developed nor validated for diagnostic reasoning, and it may produce misleading information, including plausible but incorrect conclusions that could misguide clinicians. If not used appropriately, it may fail to improve-and could even hinder-clinical decision-making. Therefore, it is essential to study how clinicians use large language models to support clinical reasoning before integrating them into routine patient care. This study will examine how immediate access to a customized version of ChatGPT-4 impacts performance on case-based diagnostic reasoning tasks, compared to a stepwise approach. In the stepwise approach, participants will first use traditional diagnostic decision support tools to support their case reasoning before interacting with a customized ChatGPT-4 model, at which point they will have the opportunity to revise their initial answers. Participants will be randomized into different study arms and will respond to diagnostic cases by providing three differential diagnoses, along with supporting and opposing findings for each. They will also identify their top diagnosis and propose next diagnostic steps. Independent reviewers, blinded to treatment assignment, will evaluate their responses.
Study Type
INTERVENTIONAL
Allocation
RANDOMIZED
Purpose
DIAGNOSTIC
Masking
SINGLE
Enrollment
70
Group is given immediate access to a customized version of GPT-4 to support their diagnostic reasoning for each case.
Group is first encouraged to reason through diagnostic cases with the support of conventional resources. After they submit a case's answers they are then given access to a customized version of GPT-4 and have the opportunity to change their initial answers.
Stanford University
Palo Alto, California, United States
Diagnostic reasoning
The primary outcome will be the percentage of correct responses per case (range: 0 to 100). For each case, participants will be asked to provide their top three differential diagnoses, along with supporting and opposing findings for each. They will receive 1 point for each plausible diagnosis. Supporting and opposing findings will be graded based on correctness, with 1 point for a partially correct response and 2 points for a completely correct response. Participants will then select their top diagnosis, earning 1 point for a reasonable choice and 2 points for the most accurate diagnosis. Finally, they will list up to three next steps for further patient evaluation, with 1 point awarded for a partially correct response and 2 points for a completely correct response. The primary outcome will be analyzed at the case level, comparing performance between the randomized study groups.
Time frame: Through study completion, an average of 6 months
Time Spent Per Case
The investigators will compare the average time (in minutes) participants spend on each case across the two study arms.
Time frame: Through study completion, an average of 6 months
Prompt frequency
The investigators will compare the frequency of participant prompts to the customized GPT-4 model between the two study groups.
Time frame: Through study completion, an average of 6 months
Sentiment
The investigators will compare the tone and sentiment of participant prompts to the customized GPT-4 model across the two study groups. The investigators will create a qualitative coding system to categorize the nature of the participants' prompts.
Time frame: Through study completion, an average of 6 months
Participant Perceptions of AI in Clinical Reasoning
This outcome would be assessed in both study arms and would encompass changes in attitudes, confidence, and willingness to use AI diagnostic tools before and after being exposed to the customized tool. We will assess the number of participants who were open to using AI to help with complex clinical reasoning (pre- and post-quiz), if they enjoyed working with the AI diagnostic tool, if they felt like the tool provided a valuable collaborative experience for clinical reasoning, if seeing the AI diagnostic tool's recommendations increased their confidence in their differential diagnoses, and if they would use an AI diagnostic tool like the one in this study in their daily job. These will be evaluated on a Likert scale ranking from strongly disagree to strongly agree.
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.
Time frame: Through study completion, an average of 6 months
Customized GPT-4's diagnostic reasoning
The customized GPT-4's 'independent' diagnoses will be assessed for accuracy. The outcome will be the percentage of correct responses per case (range: 0 to 100). For each case, the meta-prompt directs the customized GPT-4 to provide its top three differential diagnoses, along with supporting and opposing findings for each, a final diagnosis, and next steps. The customized GPT-4 will receive 1 point for each plausible diagnosis. Supporting and opposing findings will be graded based on correctness, with 1 point for a partially correct response and 2 points for a completely correct response. Its top diagnosis will earn 1 point for a reasonable choice and 2 points for the most accurate diagnosis. Finally, it will list up to three next steps for further patient evaluation, with 1 point awarded for a partially correct response and 2 points for a completely correct response. The outcome will be analyzed at the case level, comparing performance with the randomized study groups' scores.
Time frame: Through study completion, an average of 6 months