Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes

N/ANot Yet RecruitingNCT07632859

Marmara University Pendik Training and Research Hospital600 enrolled

Overview

This retrospective diagnostic accuracy study evaluates the ability of two large language models (LLMs) - GPT-4o (gpt-4o-2024-11-20; OpenAI) and Claude 4.6 Sonnet (claude-sonnet-4-6; Anthropic) - to generate correct diagnoses from anonymized Turkish-language emergency department (ED) anamnesis notes, and compares their performance with the diagnosis entered by the treating emergency physician. A consensus gold standard is established by three independent board-certified emergency medicine specialists who blindly review each note and vote on the primary diagnosis using ICD-10 three-character codes; the majority vote (at least 2 of 3 specialists agreeing) constitutes the reference standard. Both LLMs are evaluated using a standardized zero-shot direct prompting strategy (temperature=0, stateless API sessions). The primary outcome is diagnostic accuracy (proportion of ICD-10 chapter-level matches) and Cohen's kappa for each LLM against the gold standard. Secondary outcomes include top-3 accuracy, treating physician accuracy, inter-model agreement, and subgroup analyses by ESI triage level and ICD-10 chapter. Inter-rater reliability among the three specialists is quantified using Fleiss' kappa. Analyses are performed in Jamovi. This study represents the first evaluation of LLM diagnostic accuracy using Turkish-language clinical notes and the first to benchmark LLM performance against an independent three-specialist majority-vote gold standard rather than against the treating physician's own diagnosis.

STUDY DESIGN: Retrospective diagnostic accuracy study, STARD-AI 2025 reporting, single center, cohort design. AI INDEX TESTS: (1) GPT-4o (model version gpt-4o-2024-11-20; OpenAI API). (2) Claude 4.6 Sonnet (model version claude-sonnet-4-6; Anthropic API). Both accessed via Python (Google Colab). Temperature=0 for reproducibility. Zero-shot, stateless sessions - no cross-case context. No task-specific fine-tuning or additional training applied; models used as-is via API. MODEL INTERPRETABILITY: Model interpretability analyses (such as SHAP, Grad-CAM, or layer-attribute visualizations) are not applicable to this study. Because GPT-4o and Claude 4.6 Sonnet are accessed as black-box models through proprietary, closed-source commercial APIs, internal model weights, gradients, and attention architectures are structurally inaccessible for post-hoc interpretability computations. REFERENCE STANDARD: Three board-certified emergency medicine specialists independently evaluate each anonymized note, blinded to the original physician diagnosis and to each other. Primary diagnosis assigned by at least 2/3 specialists (majority vote) constitutes the gold standard. A 5-case calibration session precedes the main evaluation. DATA PRIVACY: All anamnesis notes are fully de-identified (name, ID number, date of birth, physician name removed) prior to processing. De-identified notes are stored in a password-protected encrypted database. Only de-identified text is transmitted to LLM APIs - no personal health data. Compliant with Turkish Personal Data Protection Law (KVKK No. 6698). PATIENT AND PUBLIC INVOLVEMENT: Not applicable. This retrospective study uses fully anonymized existing records; no patient or public involvement in design or conduct. DATA SHARING: Anonymized dataset will be shared via Zenodo upon article acceptance. Statistical analysis code (Jamovi project files and Python prompt scripts) will be available on GitHub.

Eligibility

Sex: ALLMin age: 18 Years

Medical Language ↔ Plain English

INCLUSION CRITERIA: * Adult patients (aged 18 years and older) presenting to the emergency department. * Complete electronic health record available in the hospital information system (HBYS) containing a detailed anamnesis note with chief complaint, symptom duration, associated symptoms, and relevant medical history. * A definitive primary diagnosis recorded by the treating emergency physician using ICD-10 codes at the time of patient file closure. EXCLUSION CRITERIA: * Emergency department anamnesis notes containing fewer than 50 words or completely lacking substantive clinical content\[cite: 1\]. * Pediatric cases (age under 18 years)\[cite: 1\]. * Patients critically ill and triaged to high-acuity resuscitation areas (Emergency Severity Index \[ESI\] level 1)\[cite: 1\]. * Clinical notes containing residual identifying information that cannot be fully de-identified, preventing compliance with data privacy regulations\[cite: 1\]. * Non-independent clinical notes consisting solely of a brief cross-reference to a prior hospital visit without a new history entry\[cite: 1\].

Outcomes

Primary Outcomes

Diagnostic Accuracy of GPT-4o for ICD-10 Chapter-Level Diagnosis

Proportion of cases in which GPT-4o primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.

Time frame: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).

Diagnostic Accuracy of Claude 4.6 Sonnet for ICD-10 Chapter-Level Diagnosis

Proportion of cases in which Claude 4.6 Sonnet primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.

Time frame: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).

Secondary Outcomes

Cohen's Kappa Between GPT-4o Primary Diagnosis and Gold Standard

Kappa coefficient measuring agreement between GPT-4o rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis \& Koch (1977): \<=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; \>0.80 almost perfect . Range: -1.00 to 1.00 .

Time frame: At the time of algorithmic evaluation (June-July 2026)

Cohen's Kappa Between Claude 4.6 Sonnet Primary Diagnosis and Gold Standard

appa coefficient measuring agreement between Claude 4.6 Sonnet rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis \& Koch (1977): \<=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; \>0.80 almost perfect . Range: -1.00 to 1.00

Time frame: At the time of algorithmic evaluation (June-July 2026)

Top-3 Diagnostic Accuracy of GPT-4o

Proportion of cases in which the 3-specialist gold standard diagnosis appears within GPT-4o's ranked list of three differential diagnoses . Range: 0 to 1.00

Time frame: At the time of algorithmic evaluation (June-July 2026)

Top-3 Diagnostic Accuracy of Claude 4.6 Sonnet

Proportion of cases in which the 3-specialist gold standard diagnosis appears within Claude 4.6 Sonnet's ranked list of three differential diagnoses\[cite: 1\]. Range: 0 to 1.00

Time frame: At the time of algorithmic evaluation (June-July 2026)

Treating Physician Diagnostic Accuracy Against Gold Standard

Proportion of cases in which the ICD-10 code entered by the treating emergency physician at file closure matches the 3-specialist majority-vote gold standard at the chapter level\[cite: 1\]. Range: 0 to 1.00

Time frame: At the time of the original clinical encounter (retrospective data spanning August-December 2025)

Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes

Overview

Conditions

Eligibility

Locations (1)

Outcomes

Primary Outcomes

Secondary Outcomes

Central Contacts