METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Article Status
Published
Authors/contributors
- Banerjee, Satanjeev (Author)
- Lavie, Alon (Author)
Title
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Abstract
We describe METEOR, an automatic
metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score
for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the correlation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation
value between its scores and human quality assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-bysegment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram precision, unigram-recall and their harmonic F1 combination. We also perform experiments to show the relative contributions of the various mapping modules
Citation
Banerjee, S., & Lavie, A. (n.d.). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.
Technical methods
Link to this record