Results – Evidence Library – Artificial Intelligence in Measurement and Education

Responsible AI Standards

Jill Burstein

|

journalArticle

Jill Burstein

Automated Essay Scoring for Nonnative English Speakers

Jill Burstein, Martin Chodorow

|

May 4th, 1999

|

conferencePaper

Jill Burstein, Martin Chodorow

May 4th, 1999

Automated Essay Scoring for Nonnative English Speakers

Jill Burstein, Martin Chodorow

|

May 4th, 1999

|

conferencePaper

Jill Burstein, Martin Chodorow

May 4th, 1999

The e-rater system™ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by...

Responsible AI Standards - Duolingo English Test

Jill Burstein, Kevin Yancey, Klinton Bic...

|

May 4th, 2023

|

document

Jill Burstein, Kevin Yancey, Klinton Bic...

May 4th, 2023

Rating Short L2 Essays on the CEFR Scale with GPT-4

Kevin P. Yancey, Geoffrey Laflair, Antho...

|

May 4th, 2023

|

conferencePaper

Kevin P. Yancey, Geoffrey Laflair, Antho...

May 4th, 2023

Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and...

A Theoretical Assessment Ecosystem for a Digital-First Assessment—The Duolingo English Test

Jill Burstein, Geoffrey T. LaFlair, Anto...

|

Mar 23rd, 2022

|

report

Jill Burstein, Geoffrey T. LaFlair, Anto...

Mar 23rd, 2022

The Duolingo English Test is a groundbreaking, digital-first, computer-adaptive English language proficiency test intended to support stakeholder admissions decisions at English-medium institutions. The test measures four key constructs for university English language proficiency: Speaking, Writing, Reading, and Listening (SWRL), and is aligned with the Common European Framework of Reference for Languages (CEFR) proficiency levels and descriptors. As a digital-first assessment, the test...

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Nitin Madnani, Anastassia Loukina, Alina...

|

May 4th, 2017

|

conferencePaper

Nitin Madnani, Anastassia Loukina, Alina...

May 4th, 2017

Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study

Jill Burstein, Geoffrey T. LaFlair, Kevi...

|

Aug 28th, 2024

|

preprint

Jill Burstein, Geoffrey T. LaFlair, Kevi...

Aug 28th, 2024

Artificial intelligence (AI) creates opportunities for assessments, such as efficiencies for item generation and scoring of spoken and written responses. At the same time, it poses risks (such as bias in AI-generated item content). Responsible AI (RAI) practices aim to mitigate risks associated with AI. This chapter addresses the critical role of RAI practices in achieving test quality (appropriateness of test score inferences), and test equity (fairness to all test takers). To illustrate,...

LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches

Imran Chamieh, Torsten Zesch, Klaus Gieb...

|

Jun 4th, 2024

|

conferencePaper

Imran Chamieh, Torsten Zesch, Klaus Gieb...

Jun 4th, 2024

In this work, we investigate the potential of Large Language Models (LLMs) for automated short answer scoring. We test zero-shot and few-shot settings, and compare with fine-tuned models and a supervised upper-bound, across three diverse datasets. Our results, in zero-shot and few-shot settings, show that LLMs perform poorly in these settings: LLMs have difficulty with tasks that require complex reasoning or domain-specific knowledge. While the models show promise on general knowledge tasks....

Search

Publication year