Results – Evidence Library – Artificial Intelligence in Measurement and Education

Bot-Adversarial Dialogue for Safe Conversational Agents

Jing Xu, Da Ju, Margaret Li

|

Jul 4th, 2021

|

conferencePaper

Jing Xu, Da Ju, Margaret Li

Jul 4th, 2021

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Jul 4th, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Jul 4th, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu

|

Feb 24th, 2020

|

preprint

Tianyi Zhang, Varsha Kishore, Felix Wu

Feb 24th, 2020

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Esin Durmus, He He, Mona Diab

|

Jul 4th, 2020

|

conferencePaper

Esin Durmus, He He, Mona Diab

Jul 4th, 2020

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Shikib Mehri, Maxine Eskenazi

|

Jul 4th, 2020

|

conferencePaper

Shikib Mehri, Maxine Eskenazi

Jul 4th, 2020

BLEURT: Learning Robust Metrics for Text Generation

Thibault Sellam, Dipanjan Das, Ankur Par...

|

Jul 4th, 2020

|

conferencePaper

Thibault Sellam, Dipanjan Das, Ankur Par...

Jul 4th, 2020

Algorithm inspection for chatbot performance evaluation

V Vijayaraghavan, Jack Brian Cooper, oth...

|

Jul 4th, 2020

|

journalArticle

V Vijayaraghavan, Jack Brian Cooper, oth...

Jul 4th, 2020

Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces?

Samuel Holmes, Anne Moorhead, Raymond Bo...

|

Sep 10th, 2019

|

conferencePaper

Samuel Holmes, Anne Moorhead, Raymond Bo...

Sep 10th, 2019

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Elizabeth Clark, Asli Celikyilmaz, Noah ...

|

Jul 4th, 2019

|

conferencePaper

Elizabeth Clark, Asli Celikyilmaz, Noah ...

Jul 4th, 2019

Best practices for the human evaluation of automatically generated text

Chris Van Der Lee, Albert Gatt, Emiel Va...

|

Jul 4th, 2019

|

conferencePaper

Chris Van Der Lee, Albert Gatt, Emiel Va...

Jul 4th, 2019

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

Kavita Ganesan

|

Mar 5th, 2018

|

preprint

Kavita Ganesan

Mar 5th, 2018

Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing ROUGE measures in terms of capturing synonymous concepts and coverage of topics. Thus, often times...

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Ryan Lowe, Michael Noseworthy, Iulian Vl...

|

Jul 4th, 2017

|

conferencePaper

Ryan Lowe, Michael Noseworthy, Iulian Vl...

Jul 4th, 2017

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnic...

|

Jun 2nd, 2015

|

preprint

Ramakrishna Vedantam, C. Lawrence Zitnic...

Jun 2nd, 2015

Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new...

LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages

Sami Virpioja, Stig-Arne Grönroos

|

Jul 4th, 2015

|

conferencePaper

Sami Virpioja, Stig-Arne Grönroos

Jul 4th, 2015

‘Realness’ in Chatbots: Establishing Quantifiable Criteria

David Hutchison, Takeo Kanade, Josef Kit...

|

Jul 4th, 2013

|

bookSection

David Hutchison, Takeo Kanade, Josef Kit...

Jul 4th, 2013

Entity resolution: theory, practice & open challenges

Lise Getoor, Ashwin Machanavajjhala

|

Aug 4th, 2012

|

journalArticle

Lise Getoor, Ashwin Machanavajjhala

Aug 4th, 2012

This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Ehud Reiter, Anja Belz

|

Dec 4th, 2009

|

journalArticle

Ehud Reiter, Anja Belz

Dec 4th, 2009

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE)...

Evaluation of machine translation and its evaluation

Joseph P. Turian, Luke Shen, I. Dan Mela...

|

Jan 1st, 2006

|

conferencePaper

Joseph P. Turian, Luke Shen, I. Dan Mela...

Jan 1st, 2006

Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...

The Significance of Recall in Automatic Metrics for MT Evaluation

David Hutchison, Takeo Kanade, Josef Kit...

|

Jul 4th, 2004

|

bookSection

David Hutchison, Takeo Kanade, Josef Kit...

Jul 4th, 2004

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin, Franz Josef Och

|

Jul 4th, 2004

|

conferencePaper

Chin-Yew Lin, Franz Josef Och

Jul 4th, 2004

Search

Technical methods

Publication year