Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Esin Durmus, He He, Mona Diab

|

Oct 18th, 2020

|

conferencePaper

Esin Durmus, He He, Mona Diab

Oct 18th, 2020

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Shikib Mehri, Maxine Eskenazi

|

Oct 18th, 2020

|

conferencePaper

Shikib Mehri, Maxine Eskenazi

Oct 18th, 2020

BLEURT: Learning Robust Metrics for Text Generation

Thibault Sellam, Dipanjan Das, Ankur Par...

|

Oct 18th, 2020

|

conferencePaper

Thibault Sellam, Dipanjan Das, Ankur Par...

Oct 18th, 2020

Algorithm inspection for chatbot performance evaluation

V Vijayaraghavan, Jack Brian Cooper, oth...

|

Oct 18th, 2020

|

journalArticle

V Vijayaraghavan, Jack Brian Cooper, oth...

Oct 18th, 2020

Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces?

Samuel Holmes, Anne Moorhead, Raymond Bo...

|

Sep 10th, 2019

|

conferencePaper

Samuel Holmes, Anne Moorhead, Raymond Bo...

Sep 10th, 2019

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Elizabeth Clark, Asli Celikyilmaz, Noah ...

|

Oct 18th, 2019

|

conferencePaper

Elizabeth Clark, Asli Celikyilmaz, Noah ...

Oct 18th, 2019

Best practices for the human evaluation of automatically generated text

Chris Van Der Lee, Albert Gatt, Emiel Va...

|

Oct 18th, 2019

|

conferencePaper

Chris Van Der Lee, Albert Gatt, Emiel Va...

Oct 18th, 2019

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

Kavita Ganesan

|

Mar 5th, 2018

|

preprint

Kavita Ganesan

Mar 5th, 2018

Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing ROUGE measures in terms of capturing synonymous concepts and coverage of topics. Thus, often times...

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Ryan Lowe, Michael Noseworthy, Iulian Vl...

|

Oct 18th, 2017

|

conferencePaper

Ryan Lowe, Michael Noseworthy, Iulian Vl...

Oct 18th, 2017

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnic...

|

Jun 2nd, 2015

|

preprint

Ramakrishna Vedantam, C. Lawrence Zitnic...

Jun 2nd, 2015

Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new...

LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages

Sami Virpioja, Stig-Arne Grönroos

|

Oct 18th, 2015

|

conferencePaper

Sami Virpioja, Stig-Arne Grönroos

Oct 18th, 2015

‘Realness’ in Chatbots: Establishing Quantifiable Criteria

David Hutchison, Takeo Kanade, Josef Kit...

|

Oct 18th, 2013

|

bookSection

David Hutchison, Takeo Kanade, Josef Kit...

Oct 18th, 2013

Entity resolution: theory, practice & open challenges

Lise Getoor, Ashwin Machanavajjhala

|

Aug 18th, 2012

|

journalArticle

Lise Getoor, Ashwin Machanavajjhala

Aug 18th, 2012

This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.

Data Mining for Education

R.S.J.d Baker, B. McGaw, P. Peterson

|

Oct 18th, 2010

|

bookSection

R.S.J.d Baker, B. McGaw, P. Peterson

Oct 18th, 2010

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Ehud Reiter, Anja Belz

|

Dec 18th, 2009

|

journalArticle

Ehud Reiter, Anja Belz

Dec 18th, 2009

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE)...

Evaluation of machine translation and its evaluation

Joseph P. Turian, Luke Shen, I. Dan Mela...

|

Jan 1st, 2006

|

conferencePaper

Joseph P. Turian, Luke Shen, I. Dan Mela...

Jan 1st, 2006

Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...

The Significance of Recall in Automatic Metrics for MT Evaluation

David Hutchison, Takeo Kanade, Josef Kit...

|

Oct 18th, 2004

|

bookSection

David Hutchison, Takeo Kanade, Josef Kit...

Oct 18th, 2004

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin, Franz Josef Och

|

Oct 18th, 2004

|

conferencePaper

Chin-Yew Lin, Franz Josef Och

Oct 18th, 2004

Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

George Doddington

|

Oct 18th, 2002

|

conferencePaper

George Doddington

Oct 18th, 2002

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd War...

|

Oct 18th, 2001

|

conferencePaper

Kishore Papineni, Salim Roukos, Todd War...

Oct 18th, 2001

Search

Empirical studies

Empirical studies

Technical methods

Publication year