Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

Laupichler, Matthias Carl; Rother, Johanna Flora; Grunwald Kadow, Ilona C.; Ahmadi, Seifollah; Raupach, Tobias

doi:10.1097/ACM.0000000000005626

Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

Article Status

Published

Authors/contributors

Title

Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

Abstract

Abstract Problem Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students’ performance on LLM-generated questions to questions developed by humans. Approach The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. Outcomes The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Next Steps Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.

Publication

Academic Medicine

Date

5/2024

Volume

99

Issue

5

Pages

508-512

Journal Abbr

Acad. Med.

DOI

10.1097/ACM.0000000000005626

Citation Key

laupichler2024

URL

https://journals.lww.com/10.1097/ACM.0000000000005626

Accessed

08/10/2025, 23:11

ISSN

1040-2446

Short Title

Large Language Models in Medical Education

Language

en

Library Catalogue

DOI.org (Crossref)

Extra

Read_Status: New Read_Status_Date: 2026-01-26T11:33:31.077Z

Citation

Laupichler, M. C., Rother, J. F., Grunwald Kadow, I. C., Ahmadi, S., & Raupach, T. (2024). Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions. Academic Medicine, 99(5), 508–512. https://doi.org/10.1097/ACM.0000000000005626

Link to this record

https://aievidencehub.org/lib/9DZQ9776