This multicenter retrospective study aims to evaluate the diagnostic and therapeutic performance of three large language models-ChatGPT, Gemini and Deepseek-using 800 archived inpatient medical records from urology departments across four tertiary hospitals. The study will focus on the accuracy and applicability of these models in disease recognition, preliminary diagnosis and treatment recommendation generation, in order to explore their potential value and limitations in supporting clinical decision-making in real-world settings.
Study Type
OBSERVATIONAL
Enrollment
800
De-identified inpatient medical records were retrospectively collected from the urology departments of four tertiary hospitals (200 cases per site, 800 in total). Each case included standardized clinical information such as demographics, chief complaint, history of present illness, past medical history, physical examination, laboratory and imaging findings, discharge diagnosis and treatment plan. To simulate the role of an AI system in a "first-visit physician" scenario, all diagnostic conclusions, differential diagnoses and treatment plans were removed before being input into the models. Three large language models (ChatGPT, Gemini and DeepSeek) were prompted with a standardized instruction: "Based on the above clinical information, provide your preliminary diagnosis, differential diagnoses and treatment recommendations." Each model generated outputs including (i) primary and secondary diagnoses, (ii) differential diagnosis lists with reasoning and (iii) preliminary treatment suggesti
The First Affiliated Hospital of Fujian Medical University
Fuzhou, China
RECRUITINGDiagnostic Accuracy: Assessed by Top-1 accuracy
Top-1: Proportion of cases where the model's first diagnosis matches the true primary diagnosis.
Time frame: Through study completion, an average of 3 months
Diagnostic Accuracy: Assessed by Top-3 accuracy
Top-3: Proportion of cases where the true diagnosis appears in the model's top 3.
Time frame: Through study completion, an average of 3 months
Diagnostic Completeness
Proportion of the model's diagnoses that overlap with all diagnoses (primary and secondary) in the case.
Time frame: Through study completion, an average of 3 months
Differential Diagnosis Quality
Evaluated by experts using a Likert 5-point scale, considering factors like common disease coverage, logical clarity, and specificity
Time frame: Through study completion, an average of 3 months
Treatment Plan Quality
Assesses whether the model's treatment suggestions align with clinical guidelines, scored by experts on completeness, appropriateness, and safety.
Time frame: Through study completion, an average of 3 months
Analysis Time
5.Time taken by the AI model to provide diagnoses and treatment suggestions (in seconds), reflecting real-time capability.
Time frame: Through study completion, an average of 3 months
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.