Rating Short L2 Essays on the CEFR Scale with GPT-4
Article Status
Published
Authors/contributors
- Yancey, Kevin P. (Author)
- Laflair, Geoffrey (Author)
- Verardi, Anthony (Author)
- Burstein, Jill (Author)
Title
Rating Short L2 Essays on the CEFR Scale with GPT-4
Abstract
Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and GPT-4 can rate short essay responses written by L2 English learners on a high-stakes language assessment, computing inter-rater agreement with human ratings. Results show that when calibration examples are provided, GPT-4 can perform almost as well as modern Automatic Writing Evaluation (AWE) methods, but agreement with human ratings can vary depending on the test-taker's first language (L1).
Date
2023
Proceedings Title
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Conference Name
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Place
Toronto, Canada
Publisher
Association for Computational Linguistics
Pages
576–584
Accessed
14/09/2023, 14:45
Library Catalogue
ACLWeb
Extra
<AI Smry>: Results show that when calibration examples are provided, GPT-4 can perform almost as well as modern Automatic Writing Evaluation (AWE) methods, but agreement with human ratings can vary depending on the test-taker’s first language (L1).
Citation
Yancey, K. P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating Short L2 Essays on the CEFR Scale with GPT-4. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 576–584. https://doi.org/10.18653/v1/2023.bea-1.49
Link to this record