Search
218 resources
-
Inioluwa Deborah Raji, Peggy Xu, Colleen...|Jun 9th, 2022|preprintInioluwa Deborah Raji, Peggy Xu, Colleen...Jun 9th, 2022
Much attention has focused on algorithmic audits and impact assessments to hold developers and users of algorithmic systems accountable. But existing algorithmic accountability policy approaches have neglected the lessons from non-algorithmic domains: notably, the importance of interventions that allow for the effective participation of third parties. Our paper synthesizes lessons from other fields on how to craft effective systems of external oversight for algorithmic deployments. First, we...
-
Anirudh Goyal, Abram L. Friesen, Andrea ...|May 24th, 2022|preprintAnirudh Goyal, Abram L. Friesen, Andrea ...May 24th, 2022
Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we...
-
Nicol Turner Lee, Samantha Lai|May 17th, 2022|webpageNicol Turner Lee, Samantha LaiMay 17th, 2022
Stakeholders in artificial intelligence must trace back to the roots of the problems, which lie in the lack of diversity in design teams and data that continues to carry on trauma and discrimination of the past, Nicol Turner Lee and Samantha Lai write.
-
Long Ouyang, Jeff Wu, Xu Jiang|Mar 4th, 2022|preprintLong Ouyang, Jeff Wu, Xu JiangMar 4th, 2022
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through...
-
Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...|Oct 31st, 2021|journalArticleKerstin Denecke, Alaa Abd-Alrazaq, Mowaf...Oct 31st, 2021
Abstract Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent...
-
Vasilis Efthymiou, Kostas Stefanidis, Ev...|Oct 26th, 2021|conferencePaperVasilis Efthymiou, Kostas Stefanidis, Ev...Oct 26th, 2021
-
Gabriel Oliveira dos Santos, Esther Luna...|Sep 28th, 2021|preprintGabriel Oliveira dos Santos, Esther Luna...Sep 28th, 2021
This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high...
-
Elizabeth Clark, Tal August, Sofia Serra...|Jul 7th, 2021|preprintElizabeth Clark, Tal August, Sofia Serra...Jul 7th, 2021
Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...
-
Xu Han, Michelle Zhou, Matthew J. Turner...|May 6th, 2021|conferencePaperXu Han, Michelle Zhou, Matthew J. Turner...May 6th, 2021
-
Michael McTear, Michael McTear|Mar 14th, 2021|bookSectionMichael McTear, Michael McTearMar 14th, 2021
-
University of Wolverhampton, UK, Hadeel ...|Mar 14th, 2021|conferencePaperUniversity of Wolverhampton, UK, Hadeel ...Mar 14th, 2021
-
Jing Xu, Da Ju, Margaret Li|Mar 14th, 2021|conferencePaperJing Xu, Da Ju, Margaret LiMar 14th, 2021
-
Weizhe Yuan, Graham Neubig, Pengfei Liu,...|Mar 14th, 2021|journalArticleWeizhe Yuan, Graham Neubig, Pengfei Liu,...Mar 14th, 2021
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...
-
Tianyi Zhang, Varsha Kishore, Felix Wu|Feb 24th, 2020|preprintTianyi Zhang, Varsha Kishore, Felix WuFeb 24th, 2020
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...
-
Esin Durmus, He He, Mona Diab|Mar 14th, 2020|conferencePaperEsin Durmus, He He, Mona DiabMar 14th, 2020
-
Shikib Mehri, Maxine Eskenazi|Mar 14th, 2020|conferencePaperShikib Mehri, Maxine EskenaziMar 14th, 2020
-
Thibault Sellam, Dipanjan Das, Ankur Par...|Mar 14th, 2020|conferencePaperThibault Sellam, Dipanjan Das, Ankur Par...Mar 14th, 2020
-
V Vijayaraghavan, Jack Brian Cooper, oth...|Mar 14th, 2020|journalArticleV Vijayaraghavan, Jack Brian Cooper, oth...Mar 14th, 2020