Search
32 resources
-
S Christie, Baptiste Moreau-Pernet, Yu T...|Jul 24th, 2024|conferencePaperS Christie, Baptiste Moreau-Pernet, Yu T...Jul 24th, 2024
Large language models (LLMs) are increasingly being deployed in user-facing applications in educational settings. Deployed applications often augment LLMs with fine-tuning, custom system prompts, and moderation layers to achieve particular goals. However, the behaviors of LLM-powered systems are difficult to guarantee, and most existing evaluations focus instead on the performance of unmodified 'foun-dation' models. Tools for evaluating such deployed systems are currently sparse, inflexible,...
-
Chi-Min Chan, Weize Chen, Yusheng Su|Aug 14th, 2023|preprintChi-Min Chan, Weize Chen, Yusheng SuAug 14th, 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
-
Isabel O. Gallegos, Ryan A. Rossi, Joe B...|Dec 27th, 2023|preprintIsabel O. Gallegos, Ryan A. Rossi, Joe B...Dec 27th, 2023
Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural...
-
Lei Huang, Weijiang Yu, Weitao Ma|Nov 9th, 2023|preprintLei Huang, Weijiang Yu, Weitao MaNov 9th, 2023
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of...
-
Steven Moore, John Stamper, Richard Tong...|Jul 7th, 2023|conferencePaperSteven Moore, John Stamper, Richard Tong...Jul 7th, 2023
-
Steven Moore, John Stamper, Richard Tong...|Jul 7th, 2023|conferencePaperSteven Moore, John Stamper, Richard Tong...Jul 7th, 2023
-
Matyáš Boháček, Steven Moore, John Stamp...|Jul 7th, 2023|conferencePaperMatyáš Boháček, Steven Moore, John Stamp...Jul 7th, 2023
-
Yejin Bang, Samuel Cahyawijaya, Nayeon L...|Feb 28th, 2023|preprintYejin Bang, Samuel Cahyawijaya, Nayeon L...Feb 28th, 2023
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms...
-
Yejin Bang, Samuel Cahyawijaya, Nayeon L...|Feb 28th, 2023|preprintYejin Bang, Samuel Cahyawijaya, Nayeon L...Feb 28th, 2023
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms...
-
Andrew M. Olney, Steven Moore, John Stam...|Jul 7th, 2023|conferencePaperAndrew M. Olney, Steven Moore, John Stam...Jul 7th, 2023
-
Matyáš Boháček, Steven Moore, John Stamp...|Jul 7th, 2023|conferencePaperMatyáš Boháček, Steven Moore, John Stamp...Jul 7th, 2023
-
Md Rayhan Kabir, Fuhua Lin, Steven Moore...|Jul 7th, 2023|conferencePaperMd Rayhan Kabir, Fuhua Lin, Steven Moore...Jul 7th, 2023
-
Shashank Sonkar, Richard G. Baraniuk, St...|Jul 7th, 2023|conferencePaperShashank Sonkar, Richard G. Baraniuk, St...Jul 7th, 2023
-
Qianou Christina Ma, Sherry Tongshuang W...|Jul 7th, 2023|conferencePaperQianou Christina Ma, Sherry Tongshuang W...Jul 7th, 2023
-
Qianou Christina Ma, Sherry Tongshuang W...|Jul 7th, 2023|conferencePaperQianou Christina Ma, Sherry Tongshuang W...Jul 7th, 2023
-
Shouvik Ahmed Antu, Haiyan Chen, Cindy K...|Jul 7th, 2023|conferencePaperShouvik Ahmed Antu, Haiyan Chen, Cindy K...Jul 7th, 2023
-
Benjamin D. Nye, Dillon Mee, Mark G. Cor...|Jul 7th, 2023|conferencePaperBenjamin D. Nye, Dillon Mee, Mark G. Cor...Jul 7th, 2023
-
Gautam Yadav, Ying-Jui Tseng, Xiaolin Ni...|Jul 7th, 2023|conferencePaperGautam Yadav, Ying-Jui Tseng, Xiaolin Ni...Jul 7th, 2023
-
Alex Goslen, Yeo Jin Kim, Jonathan Rowe,...|Jul 7th, 2023|conferencePaperAlex Goslen, Yeo Jin Kim, Jonathan Rowe,...Jul 7th, 2023
-
Daniel Leiker, Sara Finnigan, Ashley Ric...|Jul 7th, 2023|conferencePaperDaniel Leiker, Sara Finnigan, Ashley Ric...Jul 7th, 2023