Results – Evidence Library – Artificial Intelligence in Measurement and Education

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Ehud Reiter, Anja Belz

|

Dec 24th, 2009

|

journalArticle

Ehud Reiter, Anja Belz

Dec 24th, 2009

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE)...

Web-based essay critiquing system and EFL students' writing: a quantitative and qualitative investigation

Cynthia Lee, Kelvin C.K. Wong, William K...

|

Feb 24th, 2009

|

journalArticle

Cynthia Lee, Kelvin C.K. Wong, William K...

Feb 24th, 2009

SpeechRater™: A Construct-Driven Approach to Scoring Spontaneous Non-Native Speech

Klaus Zechner, Derrick Higgins, Xiaoming...

|

Oct 1st, 2007

|

journalArticle

Klaus Zechner, Derrick Higgins, Xiaoming...

Oct 1st, 2007

Toward More Substantively Meaningful Automated Essay Scoring

Anat Ben-Simon, Randy Elliot Bennett

|

Apr 24th, 2007

|

journalArticle

Anat Ben-Simon, Randy Elliot Bennett

Apr 24th, 2007

This study evaluated a “substantively driven” method for scoring NAEP writing assessments automatically. The study used variations of an existing commercial program, e-rater®, to compare the performance of three approaches to automated essay scoring: a brute-empirical approach in which variables are selected and weighted solely according to statistical criteria, a hybrid approach in which a fixed set of variables more closely tied to the characteristics of good writing was used but the...

Evaluation of machine translation and its evaluation

Joseph P. Turian, Luke Shen, I. Dan Mela...

|

Jan 1st, 2006

|

conferencePaper

Joseph P. Turian, Luke Shen, I. Dan Mela...

Jan 1st, 2006

Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...

The Significance of Recall in Automatic Metrics for MT Evaluation

Alon Lavie, Kenji Sagae, Shyamsundar Jay...

|

Apr 24th, 2004

|

bookSection

Alon Lavie, Kenji Sagae, Shyamsundar Jay...

Apr 24th, 2004

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin, Franz Josef Och

|

Apr 24th, 2004

|

conferencePaper

Chin-Yew Lin, Franz Josef Och

Apr 24th, 2004

Automatic Item Generation via Frame Semantics: Natural Language Generation of Math Word Problems

Paul Deane, Kathleen M. Sheehan

|

Apr 24th, 2003

|

conferencePaper

Paul Deane, Kathleen M. Sheehan

Apr 24th, 2003

Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

George Doddington

|

Apr 24th, 2002

|

conferencePaper

George Doddington

Apr 24th, 2002

Automated Essay Scoring Using Bayes' Theorem

Lawrence M. Rudner, T. Liang

|

Apr 24th, 2002

|

conferencePaper

Lawrence M. Rudner, T. Liang

Apr 24th, 2002

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd War...

|

Apr 24th, 2001

|

conferencePaper

Kishore Papineni, Salim Roukos, Todd War...

Apr 24th, 2001

Search

Publication year