Results – Evidence Library – Artificial Intelligence in Measurement and Education

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Elizabeth Clark, Asli Celikyilmaz, Noah ...

|

Mar 17th, 2019

|

conferencePaper

Elizabeth Clark, Asli Celikyilmaz, Noah ...

Mar 17th, 2019

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Elizabeth Clark, Tal August, Sofia Serra...

|

Mar 17th, 2021

|

preprint

Elizabeth Clark, Tal August, Sofia Serra...

Mar 17th, 2021

Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Elizabeth Clark, Tal August, Sofia Serra...

|

Mar 17th, 2021

|

preprint

Elizabeth Clark, Tal August, Sofia Serra...

Mar 17th, 2021

Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

Fluid Language Model Benchmarking

Valentin Hofmann, David Heineman, Ian Ma...

|

Sep 14th, 2025

|

preprint

Valentin Hofmann, David Heineman, Ian Ma...

Sep 14th, 2025

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...

Search

Publication year