Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Web-based essay critiquing system and EFL students' writing: a quantitative and qualitative investigation

Cynthia Lee, Kelvin C.K. Wong, William K...

|

Feb 17th, 2009

|

journalArticle

Cynthia Lee, Kelvin C.K. Wong, William K...

Feb 17th, 2009

SpeechRater™: A Construct-Driven Approach to Scoring Spontaneous Non-Native Speech

Klaus Zechner, Derrick Higgins, Xiaoming...

|

Oct 1st, 2007

|

journalArticle

Klaus Zechner, Derrick Higgins, Xiaoming...

Oct 1st, 2007

Toward More Substantively Meaningful Automated Essay Scoring

Anat Ben-Simon, Randy Elliot Bennett

|

Mar 17th, 2007

|

journalArticle

Anat Ben-Simon, Randy Elliot Bennett

Mar 17th, 2007

This study evaluated a “substantively driven” method for scoring NAEP writing assessments automatically. The study used variations of an existing commercial program, e-rater®, to compare the performance of three approaches to automated essay scoring: a brute-empirical approach in which variables are selected and weighted solely according to statistical criteria, a hybrid approach in which a fixed set of variables more closely tied to the characteristics of good writing was used but the...

Evaluation of machine translation and its evaluation

Joseph P. Turian, Luke Shen, I. Dan Mela...

|

Jan 1st, 2006

|

conferencePaper

Joseph P. Turian, Luke Shen, I. Dan Mela...

Jan 1st, 2006

Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...

The Significance of Recall in Automatic Metrics for MT Evaluation

Alon Lavie, Kenji Sagae, Shyamsundar Jay...

|

Mar 17th, 2004

|

bookSection

Alon Lavie, Kenji Sagae, Shyamsundar Jay...

Mar 17th, 2004

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin, Franz Josef Och

|

Mar 17th, 2004

|

conferencePaper

Chin-Yew Lin, Franz Josef Och

Mar 17th, 2004

Automatic Item Generation via Frame Semantics: Natural Language Generation of Math Word Problems

Paul Deane, Kathleen M. Sheehan

|

Apr 17th, 2003

|

conferencePaper

Paul Deane, Kathleen M. Sheehan

Apr 17th, 2003

Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

George Doddington

|

Mar 17th, 2002

|

conferencePaper

George Doddington

Mar 17th, 2002

Automated Essay Scoring Using Bayes' Theorem

Lawrence M. Rudner, T. Liang

|

Mar 17th, 2002

|

conferencePaper

Lawrence M. Rudner, T. Liang

Mar 17th, 2002

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd War...

|

Mar 17th, 2001

|

conferencePaper

Kishore Papineni, Salim Roukos, Todd War...

Mar 17th, 2001

Automated Essay Scoring for Nonnative English Speakers

Jill Burstein, Martin Chodorow

|

Mar 17th, 1999

|

conferencePaper

Jill Burstein, Martin Chodorow

Mar 17th, 1999

Automated Essay Scoring for Nonnative English Speakers

Jill Burstein, Martin Chodorow

|

Mar 17th, 1999

|

conferencePaper

Jill Burstein, Martin Chodorow

Mar 17th, 1999

The e-rater system™ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by...

Inducing anxiety in large language models can induce bias

journalArticle

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee, Alon Lavie

|

journalArticle

Satanjeev Banerjee, Alon Lavie

We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score...

Responsible AI Standards

Jill Burstein

|

journalArticle

Jill Burstein

Automated Scoring of Nonnative Speech Using the SpeechRaterSM v. 5.0 Engine

Lei Chen, Klaus Zechner, Su-Youn Yoon

|

report

Lei Chen, Klaus Zechner, Su-Youn Yoon

Teacher Coaching - ACL preprint

demzsky

|

document

demzsky

The m-rater™ engine: Introduction to the automated scoring of mathematics items

J. H. Fife

|

journalArticle

J. H. Fife

This report provides an introduction to the m-rater™ engine, ETS’s automated scoring engine for computer-delivered constructed-response items when the response is a number, an equation (or mathematical expression), or a graph. This introduction is intended to acquaint the reader with the types of items that m-rater can score, the requirements for authoring these items onscreen, the methods m-rater uses to score these items, and the features these items must possess to be reliably scored....

Math Education With Large Language Models: Peril or Promise?

Harsh Kumar, David M. Rothschild, Daniel...

|

journalArticle

Harsh Kumar, David M. Rothschild, Daniel...

The widespread availability of large language models (LLMs) has provoked both fear and excitement in the domain of education. On one hand, there is the concern that students will offload their coursework to LLMs, limiting what they themselves learn. On the other hand, there is the hope that LLMs might serve as scalable, personalized tutors. Here we conduct a large, pre-registered experiment involving 1200 participants to investigate how exposure to LLM-based explanations affect learning. In...

Write On with Cambi: The development of an argumentative writing feedback tool

Susan Lottridge, Amy Burkhardt, Christop...

|

journalArticle

Susan Lottridge, Amy Burkhardt, Christop...

Every year, millions of middle-school students write argumentative essays that are evaluated against a scoring rubric. However, the scores they receive don’t necessarily offer clear guidance on how to improve their essay or what they’ve done well. With advancements in natural language processing technology, we now have the capability to provide more detailed feedback. At this juncture, we’ve developed an artificial intelligence-supported editing tool to assist students in revising their...

Search

Publication year