Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

Open in Zotero

View on zotero.org

Open in Zotero

View on zotero.org

Article Status

Published

Authors/contributors

Geathers, Jadon (Author)
Hicke, Yann (Author)
Chan, Colleen (Author)
Rajashekar, Niroop (Author)
Sewell, Justin (Author)
Cornes, Susannah (Author)
Kizilcec, Rene F. (Author)
Shung, Dennis (Author)

Title

Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

Abstract

Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting. The models were benchmarked against a dataset of 10 OSCE cases with 174 expert consensus scores available. Model performance was measured using three accuracy metrics (exact, off-by-one, thresholded). Averaging across all MIRS items and OSCE cases, LLMs performed with low exact accuracy (0.27 to 0.44), and moderate to high off-by-one accuracy (0.67 to 0.87) and thresholded accuracy (0.75 to 0.88). A zero temperature parameter ensured high intra-rater reliability ({\alpha} = 0.98 for GPT-4o). CoT, few-shot, and multi-step techniques proved valuable when tailored to specific assessment items. The performance was consistent across MIRS items, independent of encounter phases and communication domains. We demonstrated the feasibility of AI-assisted OSCE evaluation and provided benchmarking of multiple LLMs across multiple prompt techniques. Our work provides a baseline performance assessment for LLMs that lays a foundation for future research into automated assessment of clinical communication skills.

Repository

arXiv

Archive ID

arXiv:2501.13957

Date

2025-05-15

DOI

10.48550/arXiv.2501.13957

Citation Key

geathers2025

URL

http://arxiv.org/abs/2501.13957

Accessed

10/10/2025, 15:52

Library Catalogue

arXiv.org

Extra

arXiv:2501.13957 [cs] <标题>: 在客观结构化临床考试（OSCEs）中对生成式人工智能进行医学生面试评分的基准测试 <AI Smry>: The feasibility of AI-assisted OSCE evaluation is demonstrated and benchmarking of multiple LLMs across multiple prompt techniques is provided, providing a baseline performance assessment for LLMs that lays a foundation for future research into automated assessment of clinical communication skills. Read_Status: New Read_Status_Date: 2026-01-26T11:32:29.004Z

Citation

Geathers, J., Hicke, Y., Chan, C., Rajashekar, N., Sewell, J., Cornes, S., Kizilcec, R. F., & Shung, D. (2025). Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) (arXiv:2501.13957). arXiv. https://doi.org/10.48550/arXiv.2501.13957

Link to this record

https://aievidencehub.org/lib/QKXEX2SC