Search
62 resources
-
Jing Xu, Da Ju, Margaret Li|Jul 1st, 2021|conferencePaperJing Xu, Da Ju, Margaret LiJul 1st, 2021
-
Weizhe Yuan, Graham Neubig, Pengfei Liu,...|Jul 1st, 2021|journalArticleWeizhe Yuan, Graham Neubig, Pengfei Liu,...Jul 1st, 2021
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...
-
Tianyi Zhang, Varsha Kishore, Felix Wu|Feb 24th, 2020|preprintTianyi Zhang, Varsha Kishore, Felix WuFeb 24th, 2020
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...
-
Esin Durmus, He He, Mona Diab|Jul 1st, 2020|conferencePaperEsin Durmus, He He, Mona DiabJul 1st, 2020
-
Shikib Mehri, Maxine Eskenazi|Jul 1st, 2020|conferencePaperShikib Mehri, Maxine EskenaziJul 1st, 2020
-
Thibault Sellam, Dipanjan Das, Ankur Par...|Jul 1st, 2020|conferencePaperThibault Sellam, Dipanjan Das, Ankur Par...Jul 1st, 2020
-
V Vijayaraghavan, Jack Brian Cooper, oth...|Jul 1st, 2020|journalArticleV Vijayaraghavan, Jack Brian Cooper, oth...Jul 1st, 2020
-
Samuel Holmes, Anne Moorhead, Raymond Bo...|Sep 10th, 2019|conferencePaperSamuel Holmes, Anne Moorhead, Raymond Bo...Sep 10th, 2019
-
Elizabeth Clark, Asli Celikyilmaz, Noah ...|Jul 1st, 2019|conferencePaperElizabeth Clark, Asli Celikyilmaz, Noah ...Jul 1st, 2019
-
Chris Van Der Lee, Albert Gatt, Emiel Va...|Jul 1st, 2019|conferencePaperChris Van Der Lee, Albert Gatt, Emiel Va...Jul 1st, 2019
-
Kavita Ganesan|Mar 5th, 2018|preprintKavita GanesanMar 5th, 2018
Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing ROUGE measures in terms of capturing synonymous concepts and coverage of topics. Thus, often times...
-
Ryan Lowe, Michael Noseworthy, Iulian Vl...|Jul 1st, 2017|conferencePaperRyan Lowe, Michael Noseworthy, Iulian Vl...Jul 1st, 2017
-
Ramakrishna Vedantam, C. Lawrence Zitnic...|Jun 2nd, 2015|preprintRamakrishna Vedantam, C. Lawrence Zitnic...Jun 2nd, 2015
Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new...
-
Sami Virpioja, Stig-Arne Grönroos|Jul 1st, 2015|conferencePaperSami Virpioja, Stig-Arne GrönroosJul 1st, 2015
-
David Hutchison, Takeo Kanade, Josef Kit...|Jul 1st, 2013|bookSectionDavid Hutchison, Takeo Kanade, Josef Kit...Jul 1st, 2013
-
Lise Getoor, Ashwin Machanavajjhala|Aug 1st, 2012|journalArticleLise Getoor, Ashwin MachanavajjhalaAug 1st, 2012
This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning, natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.
-
Ehud Reiter, Anja Belz|Dec 1st, 2009|journalArticleEhud Reiter, Anja BelzDec 1st, 2009
There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE)...
-
Joseph P. Turian, Luke Shen, I. Dan Mela...|Jan 1st, 2006|conferencePaperJoseph P. Turian, Luke Shen, I. Dan Mela...Jan 1st, 2006
Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...
-
David Hutchison, Takeo Kanade, Josef Kit...|Jul 1st, 2004|bookSectionDavid Hutchison, Takeo Kanade, Josef Kit...Jul 1st, 2004
-
Chin-Yew Lin, Franz Josef Och|Jul 1st, 2004|conferencePaperChin-Yew Lin, Franz Josef OchJul 1st, 2004