An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Reiter, Ehud; Belz, Anja

doi:10.1162/coli.2009.35.4.35405

Return

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Article Status

Published

Authors/contributors

Reiter, Ehud (Author)
Belz, Anja (Author)

Title

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Abstract

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

Publication

Computational Linguistics

Volume

35

Issue

4

Pages

529-558

Date

2009-12

Journal Abbr

Comput. Linguist.

Language

en

DOI

10.1162/coli.2009.35.4.35405

ISSN

0891-2017

URL

https://direct.mit.edu/coli/article/35/4/529-558/2024

Accessed

27/10/2023, 17:31

Library Catalogue

DOI.org (Crossref)

Extra

<AI Smry>: The results of two studies of how well some metrics which are popular in other areas of NLP correlate with human judgments in the domain of computer-generated weather forecasts suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as one would ideally like to see.

Citation

Reiter, E., & Belz, A. (2009). An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics, 35(4), 529–558. https://doi.org/10.1162/coli.2009.35.4.35405

Technical methods

model evaluation subgroup

Link to this record

https://aievidencehub.org/lib/MLFGQWMF