The goal of this randomized controlled trial is to evaluate the role of large language models in enhancing laypeople's ability to self-diagnose and triage common diseases. The main questions it aims to answer are: * Does using an LLM help participants make more accurate self-diagnoses and care decisions for common illnesses, compared to their first guess without any help? * How much better is it when people work together with an LLM, compared to using a regular search engine, using the LLM alone, or how doctors would decide? Researchers will compare participants who were randomly assigned to either the LLM group (using DeepSeek) or the search engine group to see if the LLM-assisted approach leads to better clinical judgments. Participants will: * Read one of 48 short, realistic health vignettes; * Make an initial guess about what might be wrong by listing up to three possible causes, ranked from most to least likely, and choose a care level: seek immediate care, see a doctor within one day, see a doctor within one week, or manage at home without medical care. * Use their assigned tool (either DeepSeek or a standard search engine) to look up information and update their guess and care decision; * Submit their final diagnosis and care choice after using the tool. In addition, the study team evaluated the performance of four other AI models (GPT-4o, GPT-o1, DeepSeek-v3, and DeepSeek-r1) and 33 experienced general physicians on the same vignettes.
Study Type
INTERVENTIONAL
Allocation
RANDOMIZED
Purpose
HEALTH_SERVICES_RESEARCH
Masking
SINGLE
Enrollment
6,360
Participants in this group used a large language model (DeepSeek) to search for medical information related to a clinical vignette after providing initial diagnostic and triage decisions. They were instructed to interact freely with the model to gather insights and then update their diagnoses and triage recommendations. The intervention simulates real-world use of AI tools for personal health decision-making
Participants in this group used mainstream internet search engines (e.g., Baidu, Google, Bing) to look up information about the clinical vignette after making initial diagnostic and triage decisions. They were allowed to search freely but were not permitted to use any named AI chatbot or large language model platform. This group represents typical self-directed online health information seeking behavior.
Tongji Medical College of Huazhong University of Science & Technology School of Medicine and Health Management
Wuhan, Hubei, China
Top-3 Diagnostic Accuracy
The primary diagnostic outcome was defined as the proportion of participants who included the correct diagnosis in their top three differential diagnoses after using the assigned tool (LLM or search engine). Accuracy was assessed for each of the 48 clinical vignettes and aggregated across all participants in each group.
Time frame: Immediately after intervention (within the same survey session)
Triage Accuracy (4-class exact match)
Triage accuracy was defined as the proportion of participants who selected the correct triage level (emergent care, within one day, within one week, or self-care) that matched the reference standard. There were 12 vignettes per triage category.
Time frame: Immediately after intervention (within the same survey session)
Top-1 Diagnostic Accuracy
The proportion of participants who selected the correct diagnosis as their top (first) diagnosis after using the assigned tool. This measures the precision of laypeople's final diagnostic judgment.
Time frame: Immediately after intervention (within the same survey session)
Triage Accuracy (2-class binary match)
Time frame: Immediately after intervention (within the same survey session)
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.