This retrospective diagnostic accuracy study evaluates the ability of two large language models (LLMs) - GPT-4o (gpt-4o-2024-11-20; OpenAI) and Claude 4.6 Sonnet (claude-sonnet-4-6; Anthropic) - to generate correct diagnoses from anonymized Turkish-language emergency department (ED) anamnesis notes, and compares their performance with the diagnosis entered by the treating emergency physician. A consensus gold standard is established by three independent board-certified emergency medicine specialists who blindly review each note and vote on the primary diagnosis using ICD-10 three-character codes; the majority vote (at least 2 of 3 specialists agreeing) constitutes the reference standard. Both LLMs are evaluated using a standardized zero-shot direct prompting strategy (temperature=0, stateless API sessions). The primary outcome is diagnostic accuracy (proportion of ICD-10 chapter-level matches) and Cohen's kappa for each LLM against the gold standard. Secondary outcomes include top-3 accuracy, treating physician accuracy, inter-model agreement, and subgroup analyses by ESI triage level and ICD-10 chapter. Inter-rater reliability among the three specialists is quantified using Fleiss' kappa. Analyses are performed in Jamovi. This study represents the first evaluation of LLM diagnostic accuracy using Turkish-language clinical notes and the first to benchmark LLM performance against an independent three-specialist majority-vote gold standard rather than against the treating physician's own diagnosis.
STUDY DESIGN: Retrospective diagnostic accuracy study, STARD-AI 2025 reporting, single center, cohort design. AI INDEX TESTS: (1) GPT-4o (model version gpt-4o-2024-11-20; OpenAI API). (2) Claude 4.6 Sonnet (model version claude-sonnet-4-6; Anthropic API). Both accessed via Python (Google Colab). Temperature=0 for reproducibility. Zero-shot, stateless sessions - no cross-case context. No task-specific fine-tuning or additional training applied; models used as-is via API. MODEL INTERPRETABILITY: Model interpretability analyses (such as SHAP, Grad-CAM, or layer-attribute visualizations) are not applicable to this study. Because GPT-4o and Claude 4.6 Sonnet are accessed as black-box models through proprietary, closed-source commercial APIs, internal model weights, gradients, and attention architectures are structurally inaccessible for post-hoc interpretability computations. REFERENCE STANDARD: Three board-certified emergency medicine specialists independently evaluate each anonymized note, blinded to the original physician diagnosis and to each other. Primary diagnosis assigned by at least 2/3 specialists (majority vote) constitutes the gold standard. A 5-case calibration session precedes the main evaluation. DATA PRIVACY: All anamnesis notes are fully de-identified (name, ID number, date of birth, physician name removed) prior to processing. De-identified notes are stored in a password-protected encrypted database. Only de-identified text is transmitted to LLM APIs - no personal health data. Compliant with Turkish Personal Data Protection Law (KVKK No. 6698). PATIENT AND PUBLIC INVOLVEMENT: Not applicable. This retrospective study uses fully anonymized existing records; no patient or public involvement in design or conduct. DATA SHARING: Anonymized dataset will be shared via Zenodo upon article acceptance. Statistical analysis code (Jamovi project files and Python prompt scripts) will be available on GitHub.
Study Type
OBSERVATIONAL
Enrollment
600
Marmara University Pendik Training and Research Hospital
Istanbul, Istanbul, Turkey (Türkiye)
Diagnostic Accuracy of GPT-4o for ICD-10 Chapter-Level Diagnosis
Proportion of cases in which GPT-4o primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.
Time frame: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
Diagnostic Accuracy of Claude 4.6 Sonnet for ICD-10 Chapter-Level Diagnosis
Proportion of cases in which Claude 4.6 Sonnet primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.
Time frame: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
Cohen's Kappa Between GPT-4o Primary Diagnosis and Gold Standard
Kappa coefficient measuring agreement between GPT-4o rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis \& Koch (1977): \<=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; \>0.80 almost perfect . Range: -1.00 to 1.00 .
Time frame: At the time of algorithmic evaluation (June-July 2026)
Cohen's Kappa Between Claude 4.6 Sonnet Primary Diagnosis and Gold Standard
appa coefficient measuring agreement between Claude 4.6 Sonnet rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis \& Koch (1977): \<=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; \>0.80 almost perfect . Range: -1.00 to 1.00
Time frame: At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of GPT-4o
Proportion of cases in which the 3-specialist gold standard diagnosis appears within GPT-4o's ranked list of three differential diagnoses . Range: 0 to 1.00
Time frame: At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of Claude 4.6 Sonnet
Proportion of cases in which the 3-specialist gold standard diagnosis appears within Claude 4.6 Sonnet's ranked list of three differential diagnoses\[cite: 1\]. Range: 0 to 1.00
Time frame: At the time of algorithmic evaluation (June-July 2026)
Treating Physician Diagnostic Accuracy Against Gold Standard
Proportion of cases in which the ICD-10 code entered by the treating emergency physician at file closure matches the 3-specialist majority-vote gold standard at the chapter level\[cite: 1\]. Range: 0 to 1.00
Time frame: At the time of the original clinical encounter (retrospective data spanning August-December 2025)
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.