Can Large Language Models Automatically Score Proficiency of Written Essays?

Article Status

Published

Authors/contributors

Title

Abstract

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

Repository

arXiv

Archive ID

arXiv:2403.06149

Date

2024-04-15

Citation Key

mansour2024

URL

http://arxiv.org/abs/2403.06149

Accessed

31/07/2024, 15:44

Library Catalogue

arXiv.org

Extra

arXiv:2403.06149 [cs] <标题>: 大型语言模型能否自动评估书面文章的语言能力？ Read_Status: New Read_Status_Date: 2026-01-26T11:32:28.375Z

Citation

Mansour, W., Albatarni, S., Eltanbouly, S., & Elsayed, T. (2024). Can Large Language Models Automatically Score Proficiency of Written Essays? (arXiv:2403.06149). arXiv. http://arxiv.org/abs/2403.06149

Link to this record

https://aievidencehub.org/lib/QX6GPVBE