In authors or contributors

4 resources

  • Elizabeth Clark, Asli Celikyilmaz, Noah ...
    |
    Dec 15th, 2019
    |
    conferencePaper
    Elizabeth Clark, Asli Celikyilmaz, Noah ...
    Dec 15th, 2019
  • Elizabeth Clark, Tal August, Sofia Serra...
    |
    Dec 15th, 2021
    |
    preprint
    Elizabeth Clark, Tal August, Sofia Serra...
    Dec 15th, 2021

    Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

  • Elizabeth Clark, Tal August, Sofia Serra...
    |
    Dec 15th, 2021
    |
    preprint
    Elizabeth Clark, Tal August, Sofia Serra...
    Dec 15th, 2021

    Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

  • Valentin Hofmann, David Heineman, Ian Ma...
    |
    Sep 14th, 2025
    |
    preprint
    Valentin Hofmann, David Heineman, Ian Ma...
    Sep 14th, 2025

    Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...

Last update from database: 15/12/2025, 14:15 (UTC)
Powered by Zotero and Kerko.