Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Web-based essay critiquing system and EFL students' writing: a quantitative and qualitative investigation

Cynthia Lee, Kelvin C.K. Wong, William K...

|

Feb 4th, 2009

|

journalArticle

Cynthia Lee, Kelvin C.K. Wong, William K...

Feb 4th, 2009

SpeechRater™: A Construct-Driven Approach to Scoring Spontaneous Non-Native Speech

Klaus Zechner, Derrick Higgins, Xiaoming...

|

Oct 1st, 2007

|

journalArticle

Klaus Zechner, Derrick Higgins, Xiaoming...

Oct 1st, 2007

Toward More Substantively Meaningful Automated Essay Scoring

Anat Ben-Simon, Randy Elliot Bennett

|

May 4th, 2007

|

journalArticle

Anat Ben-Simon, Randy Elliot Bennett

May 4th, 2007

This study evaluated a “substantively driven” method for scoring NAEP writing assessments automatically. The study used variations of an existing commercial program, e-rater®, to compare the performance of three approaches to automated essay scoring: a brute-empirical approach in which variables are selected and weighted solely according to statistical criteria, a hybrid approach in which a fixed set of variables more closely tied to the characteristics of good writing was used but the...

Evaluation of machine translation and its evaluation

Joseph P. Turian, Luke Shen, I. Dan Mela...

|

Jan 1st, 2006

|

conferencePaper

Joseph P. Turian, Luke Shen, I. Dan Mela...

Jan 1st, 2006

Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...

The Significance of Recall in Automatic Metrics for MT Evaluation

Alon Lavie, Kenji Sagae, Shyamsundar Jay...

|

May 4th, 2004

|

bookSection

Alon Lavie, Kenji Sagae, Shyamsundar Jay...

May 4th, 2004

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin, Franz Josef Och

|

May 4th, 2004

|

conferencePaper

Chin-Yew Lin, Franz Josef Och

May 4th, 2004

Automatic Item Generation via Frame Semantics: Natural Language Generation of Math Word Problems

Paul Deane, Kathleen M. Sheehan

|

Apr 4th, 2003

|

conferencePaper

Paul Deane, Kathleen M. Sheehan

Apr 4th, 2003

Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

George Doddington

|

May 4th, 2002

|

conferencePaper

George Doddington

May 4th, 2002

Automated Essay Scoring Using Bayes' Theorem

Lawrence M. Rudner, T. Liang

|

May 4th, 2002

|

conferencePaper

Lawrence M. Rudner, T. Liang

May 4th, 2002

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd War...

|

May 4th, 2001

|

conferencePaper

Kishore Papineni, Salim Roukos, Todd War...

May 4th, 2001

Automated Essay Scoring for Nonnative English Speakers

Jill Burstein, Martin Chodorow

|

May 4th, 1999

|

conferencePaper

Jill Burstein, Martin Chodorow

May 4th, 1999

Automated Essay Scoring for Nonnative English Speakers

Jill Burstein, Martin Chodorow

|

May 4th, 1999

|

conferencePaper

Jill Burstein, Martin Chodorow

May 4th, 1999

The e-rater system™ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by...

2025-benchmarks-in-the-AI-lifecycle-diagram

document

Benchmark Evaluation Resources 10April2026

document

Benchmarks 101

report

How2Bench

document

Inducing anxiety in large language models can induce bias

journalArticle

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee, Alon Lavie

|

journalArticle

Satanjeev Banerjee, Alon Lavie

We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score...

Responsible AI Standards

Jill Burstein

|

journalArticle

Jill Burstein

Automated Scoring of Nonnative Speech Using the SpeechRaterSM v. 5.0 Engine

Lei Chen, Klaus Zechner, Su-Youn Yoon

|

report

Lei Chen, Klaus Zechner, Su-Youn Yoon

Search

Publication year