5 resources

  • Shiki Sato, Yosuke Kishinami, Hiroaki Su...
    |
    Nov 14th, 2022
    |
    conferencePaper
    Shiki Sato, Yosuke Kishinami, Hiroaki Su...
    Nov 14th, 2022

    Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the...

  • Anita Schick, Jasper Feine, Stefan Moran...
    |
    Oct 31st, 2022
    |
    journalArticle
    Anita Schick, Jasper Feine, Stefan Moran...
    Oct 31st, 2022

    Mental disorders in adolescence and young adulthood are major public health concerns. Digital tools such as text-based conversational agents (ie, chatbots) are a promising technology for facilitating mental health assessment. However, the human-like interaction style of chatbots may induce potential biases, such as socially desirable responding (SDR), and may require further effort to complete assessments.

  • Ming Zhong, Yang Liu, Da Yin
    |
    Oct 13th, 2022
    |
    preprint
    Ming Zhong, Yang Liu, Da Yin
    Oct 13th, 2022

    Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

  • Cyril Chhun, Pierre Colombo, Chloé Clave...
    |
    Sep 15th, 2022
    |
    preprint
    Cyril Chhun, Pierre Colombo, Chloé Clave...
    Sep 15th, 2022

    Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10...

  • Pierre Jean A. Colombo, Chloé Clavel, Pa...
    |
    Jun 28th, 2022
    |
    journalArticle
    Pierre Jean A. Colombo, Chloé Clavel, Pa...
    Jun 28th, 2022

    Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce...

Last update from database: 02/03/2025, 19:15 (UTC)
Powered by Zotero and Kerko.