Results – Evidence Library – Artificial Intelligence in Measurement and Education

Evaluation Metrics for Health Chatbots: A Delphi Study

Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...

|

Oct 31st, 2021

|

journalArticle

Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...

Oct 31st, 2021

Abstract Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent...

FairER: Entity Resolution With Fairness Constraints

Vasilis Efthymiou, Kostas Stefanidis, Ev...

|

Oct 26th, 2021

|

conferencePaper

Vasilis Efthymiou, Kostas Stefanidis, Ev...

Oct 26th, 2021

CIDEr-R: Robust Consensus-based Image Description Evaluation

Gabriel Oliveira dos Santos, Esther Luna...

|

Sep 28th, 2021

|

preprint

Gabriel Oliveira dos Santos, Esther Luna...

Sep 28th, 2021

This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high...

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Elizabeth Clark, Tal August, Sofia Serra...

|

Jul 7th, 2021

|

preprint

Elizabeth Clark, Tal August, Sofia Serra...

Jul 7th, 2021

Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

Designing Effective Interview Chatbots: Automatic Chatbot Profiling and Design Suggestion Generation for Chatbot Debugging

Xu Han, Michelle Zhou, Matthew J. Turner...

|

May 6th, 2021

|

conferencePaper

Xu Han, Michelle Zhou, Matthew J. Turner...

May 6th, 2021

Evaluating Dialogue Systems

Michael McTear, Michael McTear

|

Jul 1st, 2021

|

bookSection

Michael McTear, Michael McTear

Jul 1st, 2021

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

University of Wolverhampton, UK, Hadeel ...

|

Jul 1st, 2021

|

conferencePaper

University of Wolverhampton, UK, Hadeel ...

Jul 1st, 2021

Bot-Adversarial Dialogue for Safe Conversational Agents

Jing Xu, Da Ju, Margaret Li

|

Jul 1st, 2021

|

conferencePaper

Jing Xu, Da Ju, Margaret Li

Jul 1st, 2021

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Jul 1st, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Jul 1st, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

Search

Technical methods

Publication year