Observational Study on AI Accuracy in Diagnosing and Treating Failed or Painful Hip Arthroplasty

Istituto Ortopedico Rizzoli20 enrolled

Overview

Primary Goal: This study aims to evaluate the diagnostic and therapeutic accuracy of GPT-4 (an advanced AI language model) compared to three orthopedic surgeons with varying experience levels in cases of failed or painful total hip arthroplasty. Key Research Questions: Diagnostic Accuracy: Does GPT-4 provide correct, partially correct, or incorrect diagnoses compared to human orthopaedic surgeons? Diagnostic Completeness: Are GPT-4's diagnostic suggestions complete, partially complete, or incomplete compared to those of orthopedic surgeons? Treatment Accuracy: Does GPT-4 recommend correct, partially correct, or incorrect treatments for failed hip arthroplasty? Treatment Completeness: Are GPT-4's treatment recommendations fully comprehensive, partially complete, or incomplete compared to those of orthopaedic surgeon? Study Design: Participants: 20 anonymized patient cases (ages 18-80) with failed or painful hip arthroplasties, treated at IRCCS Istituto Ortopedico Rizzoli (Bologna, Italy) between 2004-2024. Cases were selected based on clear diagnostic and treatment records (no ambiguous or incomplete data). Comparison Groups: GPT-4 (via ChatGPT interface) Three orthopedic doctors (with different experience levels: resident, specialist, senior surgeon) Method: Each case (clinical summary + X-ray image) is presented to GPT-4 and the three doctors. They must provide a diagnosis and treatment recommendations. Two independent evaluators (principal investigator + department head) blindly assess responses for correctness and completeness using a 3-point scale (0=wrong/incomplete, 2=correct/complete). Statistical analysis compares GPT-4 vs. human performance. Expected Outcomes: Determine if AI can match or outperform doctors in diagnosing and treating hip arthroplasty failures. Assess whether GPT-4 could serve as a supplementary tool in orthopedic decision-making. Ethical \& Privacy Considerations: No real-time patient data is used-only anonymized past cases. No personal/sensitive data is shared with OpenAI (GPT-4 is used via a standard web interface). Study complies with GDPR, HIPAA, and ethical AI guidelines. Timeline: Study duration: \~8 months (from ethics approval to final analysis). Results will be published regardless of outcome. Why This Study Matters: First study evaluating GPT-4's role in complex orthopedic diagnostics. Could influence future AI-assisted clinical decision-making in joint replacement surgeries.

Study Type

OBSERVATIONAL

Enrollment

Interventions

GPT-4 AssessmentOTHER

Diagnostic/Prognostic evaluation of any single case provided by AI (GPT-4). GPT-4 provides diagnosis/treatment recommendations via standardized prompts

Arthroplasty Fellow AssessmentOTHER

Diagnostic/Prognostic evaluation of any single case provided by an human expert

Specializing Resident (4th year) AssessmentOTHER

Diagnostic/Prognostic evaluation of any single case provided by an human expert

Junior Resident (3rd year) AssessmentOTHER

Diagnostic/Prognostic evaluation of any single case provided by an human expert

Eligibility

Sex: ALLMin age: 18 YearsMax age: 80 Years

Medical Language ↔ Plain English

Inclusion Criteria: * Adults (≥18 and ≤80 years old). * Documented painful or failed total hip arthroplasty requiring clinical/radiological evaluation (2004-2024). * Complete pre-operative clinical history, imaging (X-ray/tomography), and surgical reports. * Clear diagnosis of failure mode (e.g., aseptic loosening, infection, fracture, wear). * Treatment and outcomes fully documented in the institutional database. * "Exemplary" cases with minimal diagnostic ambiguity (per Engh/MusculoSkleletal Infection Society criteria, etc.). Exclusion Criteria: * total hip arthroplasty with no documented failure/pain (well-functioning implants). * Incomplete clinical/radiological records (e.g., missing pre-operative imaging or surgical notes). * Complex/multifactorial failures (e.g., concurrent infection + loosening + fracture). * Radiographs/images non-interpretable (poor quality, missing views). * Cases with conflicting diagnoses/treatments in original records.

Outcomes

Primary Outcomes

Diagnostic correctness

Proportion of fully correct diagnoses (score=2) by each rater, Scale 0 (worst outcome) - 2 (best outcome). 0: incorrect, 1: imprecise, 2: correct

Time frame: Immediate (post-case evaluation)

Diagnostic completeness

Proportion of fully complete diagnoses (score=2). Scale 0 (worst outcome) - 2 (best outcome). 0: incomplete, 1: partially complete, 2: complete

Time frame: Immediate (post-case evaluation)

Treatment recommendation correctness

Proportion of fully correct treatments (score=2) by each rater. Scale 0 (worst outcome) - 2 (best outcome). 0: incorrect, 1: imprecise, 2: correct

Time frame: Immediate (post-case evaluation)

Treatmetn recommendation completeness

Proportion of fully complete treatments (score=2). Scale 0 (worst outcome) - 2 (best outcome). 0: incomplete, 1: partially complete, 2: complete

Time frame: Immediate (post-case evaluation)

Observational Study on AI Accuracy in Diagnosing and Treating Failed or Painful Hip Arthroplasty

Overview

Conditions

Interventions

Eligibility

Locations (1)

Outcomes

Primary Outcomes