This prospective observational study aims to evaluate the effectiveness and educational value of artificial intelligence (AI)-generated multiple true/false questions compared to those developed by experienced academicians in anesthesiology training. A total of 27 anesthesiology residents will be included in the study. Question sets consisting of 200 multiple true/false items will be created, with half generated by academicians and the other half generated using an artificial intelligence model (ChatGPT-based system). The questions will be based on standardized educational materials from the anesthesiology training curriculum. Participants will complete the test in a single session. Each correct answer will be scored as one point, and total scores will be calculated. In addition to test performance, item difficulty, discrimination indices, and test reliability will be analyzed. Furthermore, participants' perceptions regarding question quality will be evaluated. The study aims to determine whether AI-generated questions can provide a reliable and effective alternative to traditional question development methods in medical education and contribute to more objective and standardized assessment processes.
This single-center, prospective observational cohort study is designed to evaluate the effectiveness, reliability, and educational value of artificial intelligence (AI)-generated multiple true/false (MTF) questions compared to those developed by experienced academicians in anesthesiology training. The study will be conducted at the Department of Anesthesiology and Reanimation, Kütahya Health Sciences University. A total of 27 anesthesiology residents will be included and categorized into two groups based on their level of training: junior residents (≤2.5 years of training) and senior residents (\>2.5 years of training). A total of 200 MTF questions will be developed based on standardized anesthesiology educational materials. Half of the questions (n=100) will be prepared by experienced academicians, while the remaining half (n=100) will be generated using an artificial intelligence model (ChatGPT-based system). All questions will be structured according to predefined criteria, including difficulty level (easy, moderate, difficult), clinical relevance, and educational appropriateness. Participants will complete the question sets in a single session under standardized conditions. Each correct answer will be scored as 1 point, and incorrect answers will be scored as 0. Total test scores will be calculated for each participant. Item analysis will be performed to evaluate the psychometric properties of the questions. Item difficulty index, item discrimination index, and overall test reliability will be calculated. Additionally, perceived question quality will be assessed using participant feedback. Statistical analysis will be conducted using SPSS software. The distribution of variables will be assessed, and appropriate parametric or non-parametric tests will be used accordingly. Comparisons between groups (junior vs senior residents) and between question sources (AI-generated vs academician-developed) will be performed. A p-value of \<0.05 will be considered statistically significant. The study does not involve any clinical intervention, drug administration, or invasive procedure. Participation is voluntary, and written informed consent will be obtained from all participants. All data will be collected anonymously and used solely for research purposes. The results of this study are expected to provide insight into the potential role of artificial intelligence in medical education, particularly in the development of assessment tools, and may contribute to more objective, standardized, and efficient evaluation methods in anesthesiology training.
Study Type
OBSERVATIONAL
Enrollment
27
Kütahya Health Sciences University
Kütahya, Turkey (Türkiye)
Item Difficulty Index of AI-generated and expert-authored questions
For each question, the item difficulty index will be calculated as the proportion of participants who answer the item correctly. Item difficulty indices will be compared between AI-generated and expert-authored questions.
Time frame: Assessed once after completion of each participant's single 60-minute examination session; final item analysis performed after all participants complete the examination, up to 1 month.
This platform is for informational purposes only and does not constitute medical advice. Always consult a qualified healthcare professional.