Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models

Article Status
Published
Authors/contributors
Title
Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models
Abstract
Recent developments in the field of artificial intelligence allow for improved performance in the automated assessment of extended response items in mathematics, potentially allowing for the scoring of these items cheaply and at scale. This study details the grand prize-winning approach to developing large language models (LLMs) to automatically score the ten items in the National Assessment of Educational Progress (NAEP) Math Scoring Challenge. The approach uses extensive preprocessing that balanced the class labels for each item. This was done by identifying and filtering over-represented classes using a classifier trained on document-term matrices and data augmentation of under-represented classes using a generative pre-trained large language model (Grammarly’s Coedit-XL; Raheja et al., 2023). We also use input modification schemes that were hand-crafted to each item type and included information from parts of the multi-step math problem students had to solve. Finally, we finetune several pre-trained large language models on the modified input for each individual item in the NAEP automated math scoring challenge, with DeBERTa (He et al., 2021a) showing the best performance. This approach achieved human-like agreement (less than QWK 0.05 difference from human–human agreement) on nine out of the ten items in a held-out test set.
Publication
International Journal of Artificial Intelligence in Education
Volume
35
Issue
2
Pages
559-586
Date
2024-7-18
Journal Abbr
Int. J. Artif. Intell. Educ.
Language
en
ISSN
1560-4292
Accessed
31/07/2024, 15:46
Library Catalogue
DOI.org (Crossref)
Extra
Citation Key: morris2024 <标题>: 使用大型语言模型对数学评估中的构建式回答题进行自动评分 <AI Smry>: This study details the grand prize-winning approach to developing large language models (LLMs) to automatically score the ten items in the National Assessment of Educational Progress (NAEP) Math Scoring Challenge, with DeBERTa (He et al., 2021a) showing the best performance.
Citation
Morris, W., Holmes, L., Choi, J. S., & Crossley, S. (2024). Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models. International Journal of Artificial Intelligence in Education, 35(2), 559–586. https://doi.org/10.1007/s40593-024-00418-w
Powered by Zotero and Kerko.