Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems

Shiki Sato, Yosuke Kishinami, Hiroaki Su...

|

Nov 5th, 2022

|

conferencePaper

Shiki Sato, Yosuke Kishinami, Hiroaki Su...

Nov 5th, 2022

Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the...

Validity of Chatbot Use for Mental Health Assessment: Experimental Study

Anita Schick, Jasper Feine, Stefan Moran...

|

Oct 31st, 2022

|

journalArticle

Anita Schick, Jasper Feine, Stefan Moran...

Oct 31st, 2022

Mental disorders in adolescence and young adulthood are major public health concerns. Digital tools such as text-based conversational agents (ie, chatbots) are a promising technology for facilitating mental health assessment. However, the human-like interaction style of chatbots may induce potential biases, such as socially desirable responding (SDR), and may require further effort to complete assessments.

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Oct 13th, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Oct 13th, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Cyril Chhun, Pierre Colombo, Chloé Clave...

|

Sep 15th, 2022

|

preprint

Cyril Chhun, Pierre Colombo, Chloé Clave...

Sep 15th, 2022

Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10...

InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation

Pierre Jean A. Colombo, Chloé Clavel, Pa...

|

Jun 28th, 2022

|

journalArticle

Pierre Jean A. Colombo, Chloé Clavel, Pa...

Jun 28th, 2022

Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce...

Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance

Inioluwa Deborah Raji, Peggy Xu, Colleen...

|

Jun 9th, 2022

|

preprint

Inioluwa Deborah Raji, Peggy Xu, Colleen...

Jun 9th, 2022

Much attention has focused on algorithmic audits and impact assessments to hold developers and users of algorithmic systems accountable. But existing algorithmic accountability policy approaches have neglected the lessons from non-algorithmic domains: notably, the importance of interventions that allow for the effective participation of third parties. Our paper synthesizes lessons from other fields on how to craft effective systems of external oversight for algorithmic deployments. First, we...

Retrieval-Augmented Reinforcement Learning

Anirudh Goyal, Abram L. Friesen, Andrea ...

|

May 24th, 2022

|

preprint

Anirudh Goyal, Abram L. Friesen, Andrea ...

May 24th, 2022

Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we...

The U.S. can improve its AI governance strategy by addressing online biases

Nicol Turner Lee, Samantha Lai

|

May 17th, 2022

|

webpage

Nicol Turner Lee, Samantha Lai

May 17th, 2022

Stakeholders in artificial intelligence must trace back to the roots of the problems, which lie in the lack of diversity in design teams and data that continues to carry on trauma and discrimination of the past, Nicol Turner Lee and Samantha Lai write.

Now Is the Time to Build a National Data Ecosystem for Materials Science and Chemistry Research Data

Eva M. Campo, Sadasivan Shankar, Alexand...

|

Apr 13th, 2022

|

journalArticle

Eva M. Campo, Sadasivan Shankar, Alexand...

Apr 13th, 2022

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang

|

Mar 4th, 2022

|

preprint

Long Ouyang, Jeff Wu, Xu Jiang

Mar 4th, 2022

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through...

Evaluation Metrics for Health Chatbots: A Delphi Study

Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...

|

Oct 31st, 2021

|

journalArticle

Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...

Oct 31st, 2021

Abstract Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent...

FairER: Entity Resolution With Fairness Constraints

Vasilis Efthymiou, Kostas Stefanidis, Ev...

|

Oct 26th, 2021

|

conferencePaper

Vasilis Efthymiou, Kostas Stefanidis, Ev...

Oct 26th, 2021

CIDEr-R: Robust Consensus-based Image Description Evaluation

Gabriel Oliveira dos Santos, Esther Luna...

|

Sep 28th, 2021

|

preprint

Gabriel Oliveira dos Santos, Esther Luna...

Sep 28th, 2021

This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high...

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Elizabeth Clark, Tal August, Sofia Serra...

|

Jul 7th, 2021

|

preprint

Elizabeth Clark, Tal August, Sofia Serra...

Jul 7th, 2021

Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

Designing Effective Interview Chatbots: Automatic Chatbot Profiling and Design Suggestion Generation for Chatbot Debugging

Xu Han, Michelle Zhou, Matthew J. Turner...

|

May 6th, 2021

|

conferencePaper

Xu Han, Michelle Zhou, Matthew J. Turner...

May 6th, 2021

Evaluating Dialogue Systems

Michael McTear, Michael McTear

|

Jul 5th, 2021

|

bookSection

Michael McTear, Michael McTear

Jul 5th, 2021

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

University of Wolverhampton, UK, Hadeel ...

|

Jul 5th, 2021

|

conferencePaper

University of Wolverhampton, UK, Hadeel ...

Jul 5th, 2021

Bot-Adversarial Dialogue for Safe Conversational Agents

Jing Xu, Da Ju, Margaret Li

|

Jul 5th, 2021

|

conferencePaper

Jing Xu, Da Ju, Margaret Li

Jul 5th, 2021

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Jul 5th, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Jul 5th, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu

|

Feb 24th, 2020

|

preprint

Tianyi Zhang, Varsha Kishore, Felix Wu

Feb 24th, 2020

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...

Search

Empirical studies

Empirical studies

Technical methods

Publication year