AI vs Human Exam Assessment and Development (AHEAD Trial)

University of British Columbia258 enrolled

Overview

The Artificial Intelligence (AI) vs Human Exam Assessment and Development (AHEAD) Trial is a participant-blinded randomized controlled trial conducted among first-year medical students at the University of British Columbia. The study evaluates whether multiple-choice examination questions generated using large language models (LLMs) perform comparably to traditionally human-written questions in medical education. Participants were randomized to complete one of two versions of a formative mock final examination consisting of 112 case-based single-best-answer multiple-choice questions (MCQs) aligned with the same course learning objectives. One exam version contained AI-generated questions produced using a structured LLM workflow with independent AI verification, while the other contained questions authored by senior medical students using conventional methods. The study evaluates exam feasibility, psychometric reliability, validity, student acceptability, and educational impact. Outcomes include exam performance, item discrimination indices, distractor efficiency, student perceptions of exam quality and difficulty, and changes in perceived preparedness for the upcoming summative examination.

The AHEAD Trial (AI vs Human Exam Assessment and Development) is a single-center, participant-blinded randomized controlled trial conducted among first-year Doctor of Medicine (MD) students enrolled in the Foundations of Medical Practice I (MEDD 411) course at the University of British Columbia. Participants were randomized in a 1:1 ratio to complete either an AI-generated or a human-generated mock final examination. Both exams consisted of 112 case-based single-best-answer multiple-choice questions (MCQs) aligned with the same MEDD 411 curricular objectives. AI-generated questions were produced using a structured workflow involving ChatGPT for question generation and Google Gemini for independent verification. Human-generated questions were authored by senior medical students without AI assistance and underwent independent peer review. Both exams followed identical formatting guidelines and assessed the same learning objectives. All participants completed identical pre-exam and post-exam surveys assessing demographic characteristics, familiarity with artificial intelligence in education, and perceptions of the examination experience. The study evaluates the utility of AI-generated assessments using van der Vleuten's Assessment Utility Framework, including feasibility, reliability, validity, acceptability, and educational impact. The trial aims to determine whether large language models can accelerate the development of formative medical examinations while maintaining comparable psychometric quality and educational value relative to traditional human-authored questions.

Outcomes

Primary Outcomes

Student performance on the mock examination

Comparison of mean examination scores between students randomized to the AI-generated versus human-generated mock examinations.

Time frame: Immediately after completion of the mock examination

Secondary Outcomes

Item discrimination index

Item-level discrimination index comparing AI-generated and human-generated multiple-choice questions, representing the difference in the proportion of correct responses between high-performing and low-performing students.

Time frame: Immediately after the completion of the mock examination

Distractor efficiency

Proportion of distractors selected by at least 5% of participants, comparing AI-generated and human-generated questions.

Time frame: Immediately after the completion of the mock examination

Student-rated examination quality and acceptability

Student ratings of exam difficulty, clarity, relevance to course material, adequacy of time, multiple-choice question quality, understanding of clinical concepts, identification of knowledge gaps, retention for future clinical practice, and preparedness for the upcoming summative exam, measured immediately after exam completion using 10-point Likert scales (1 = lowest rating, 10 = highest rating). For most domains, higher scores indicate greater endorsement of the construct being measured; for the difficulty item, higher scores indicate greater perceived difficulty.

Time frame: Immediately after completion of the mock examination

Efficiency ratio of MCQ development time per matched learning objective

The outcome measuring the development efficiency of artificial intelligence (AI)-generated versus human-generated multiple-choice questions (MCQs). The efficiency ratio was calculated as human-generated MCQ development time divided by AI-generated MCQ development time for matched learning objectives.

Time frame: Baseline (prior to participant testing)

Change in perceived preparedness for the summative examination

Change from pre-exam to post-exam in self-rated preparedness for the upcoming summative examination, measured on a 10-point Likert scale (1 = not at all prepared; 10 = extremely prepared), with higher scores indicating greater perceived preparedness.

Time frame: Before and immediately after completion of the mock examination

AI vs Human Exam Assessment and Development (AHEAD Trial)

Overview

Conditions

Interventions

Eligibility

Locations (1)

Outcomes

Primary Outcomes

Secondary Outcomes