Search
9 resources
-
Jill Burstein|journalArticleJill Burstein
-
Jill Burstein, Martin Chodorow|Dec 16th, 1999|conferencePaperJill Burstein, Martin ChodorowDec 16th, 1999
-
Jill Burstein, Martin Chodorow|Dec 16th, 1999|conferencePaperJill Burstein, Martin ChodorowDec 16th, 1999
The e-rater system™ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by...
-
Jill Burstein, Kevin Yancey, Klinton Bic...|Dec 16th, 2023|documentJill Burstein, Kevin Yancey, Klinton Bic...Dec 16th, 2023
-
Kevin P. Yancey, Geoffrey Laflair, Antho...|Dec 16th, 2023|conferencePaperKevin P. Yancey, Geoffrey Laflair, Antho...Dec 16th, 2023
Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and...
-
Jill Burstein, Geoffrey T. LaFlair, Anto...|Mar 23rd, 2022|reportJill Burstein, Geoffrey T. LaFlair, Anto...Mar 23rd, 2022
The Duolingo English Test is a groundbreaking, digital-first, computer-adaptive English language proficiency test intended to support stakeholder admissions decisions at English-medium institutions. The test measures four key constructs for university English language proficiency: Speaking, Writing, Reading, and Listening (SWRL), and is aligned with the Common European Framework of Reference for Languages (CEFR) proficiency levels and descriptors. As a digital-first assessment, the test...
-
Nitin Madnani, Anastassia Loukina, Alina...|Dec 16th, 2017|conferencePaperNitin Madnani, Anastassia Loukina, Alina...Dec 16th, 2017
-
Jill Burstein, Geoffrey T. LaFlair, Kevi...|Aug 28th, 2024|preprintJill Burstein, Geoffrey T. LaFlair, Kevi...Aug 28th, 2024
Artificial intelligence (AI) creates opportunities for assessments, such as efficiencies for item generation and scoring of spoken and written responses. At the same time, it poses risks (such as bias in AI-generated item content). Responsible AI (RAI) practices aim to mitigate risks associated with AI. This chapter addresses the critical role of RAI practices in achieving test quality (appropriateness of test score inferences), and test equity (fairness to all test takers). To illustrate,...
-
Imran Chamieh, Torsten Zesch, Klaus Gieb...|Jun 16th, 2024|conferencePaperImran Chamieh, Torsten Zesch, Klaus Gieb...Jun 16th, 2024
In this work, we investigate the potential of Large Language Models (LLMs) for automated short answer scoring. We test zero-shot and few-shot settings, and compare with fine-tuned models and a supervised upper-bound, across three diverse datasets. Our results, in zero-shot and few-shot settings, show that LLMs perform poorly in these settings: LLMs have difficulty with tasks that require complex reasoning or domain-specific knowledge. While the models show promise on general knowledge tasks....