274 resources

  • Sabina Elkins, Ekaterina Kochmar, Jackie...
    |
    Apr 4th, 2023
    |
    preprint
    Sabina Elkins, Ekaterina Kochmar, Jackie...
    Apr 4th, 2023

    Controllable text generation (CTG) by large language models has a huge potential to transform education for teachers and students alike. Specifically, high quality and diverse question generation can dramatically reduce the load on teachers and improve the quality of their educational content. Recent work in this domain has made progress with generation, but fails to show that real teachers judge the generated questions as sufficiently useful for the classroom setting; or if instead the...

  • Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...
    |
    Apr 4th, 2023
    |
    journalArticle
    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...
    Apr 4th, 2023

    Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

  • Isabel O. Gallegos, Ryan A. Rossi, Joe B...
    |
    Apr 4th, 2023
    |
    preprint
    Isabel O. Gallegos, Ryan A. Rossi, Joe B...
    Apr 4th, 2023

    Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural...

  • Harsh Kumar, David M. Rothschild, Daniel...
    |
    Apr 4th, 2023
    |
    preprint
    Harsh Kumar, David M. Rothschild, Daniel...
    Apr 4th, 2023

    The widespread availability of large language models (LLMs) has provoked both fear and excitement in the domain of education.On one hand, there is the concern that students will offload their coursework to LLMs, limiting what they themselves learn.On the other hand, there is the hope that LLMs might serve as scalable, personalized tutors.Here we conduct a large, pre-registered experiment involving 1200 participants to investigate how exposure to LLM-based explanations affect learning.In the...

  • Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...
    |
    Apr 4th, 2023
    |
    journalArticle
    Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...
    Apr 4th, 2023

    This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

  • Fengchun Miao, Wayne Holmes
    |
    Apr 4th, 2023
    |
    book
    Fengchun Miao, Wayne Holmes
    Apr 4th, 2023
  • Ramon Pires, Hugo Abonizio, Thales Sales...
    |
    Apr 4th, 2023
    |
    preprint
    Ramon Pires, Hugo Abonizio, Thales Sales...
    Apr 4th, 2023

    As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models...

  • Ramon Pires, Hugo Abonizio, Thales Sales...
    |
    Apr 4th, 2023
    |
    preprint
    Ramon Pires, Hugo Abonizio, Thales Sales...
    Apr 4th, 2023

    As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models...

  • Shashank Sonkar, Naiming Liu, Debshila M...
    |
    Apr 4th, 2023
    |
    conferencePaper
    Shashank Sonkar, Naiming Liu, Debshila M...
    Apr 4th, 2023
  • Valdemar Švábenský, Ryan S. Baker, André...
    |
    Apr 4th, 2023
    |
    conferencePaper
    Valdemar Švábenský, Ryan S. Baker, André...
    Apr 4th, 2023
  • Jiaan Wang, Yunlong Liang, Fandong Meng,...
    |
    Apr 4th, 2023
    |
    journalArticle
    Jiaan Wang, Yunlong Liang, Fandong Meng,...
    Apr 4th, 2023

    Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with...

  • Kevin P. Yancey, Geoffrey Laflair, Antho...
    |
    Apr 4th, 2023
    |
    conferencePaper
    Kevin P. Yancey, Geoffrey Laflair, Antho...
    Apr 4th, 2023

    Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and...

  • Kevin P. Yancey, Geoffrey Laflair, Antho...
    |
    Apr 4th, 2023
    |
    conferencePaper
    Kevin P. Yancey, Geoffrey Laflair, Antho...
    Apr 4th, 2023

    Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and...

  • Shuyan Zhou, Uri Alon, Sumit Agarwal
    |
    Apr 4th, 2023
    |
    conferencePaper
    Shuyan Zhou, Uri Alon, Sumit Agarwal
    Apr 4th, 2023
  • Zihao Zhou, Maizhen Ning, Qiufeng Wang
    |
    Apr 4th, 2023
    |
    conferencePaper
    Zihao Zhou, Maizhen Ning, Qiufeng Wang
    Apr 4th, 2023
  • EdArXiv
    |
    Dec 19th, 2022
    |
    report
    EdArXiv
    Dec 19th, 2022

    Predictive analytics methods in education are seeing widespread use and are producing increasingly accurate predictions of students’ outcomes. With the increased use of predictive analytics comes increasing concern about fairness for specific subgroups of the population. One approach that has been proposed to increase fairness is using demographic variables directly in models, as predictors. In this paper we explore issues of fairness in the use of demographic variables as predictors of...

  • Alexandra Sasha Luccioni, Sylvain Viguie...
    |
    Nov 3rd, 2022
    |
    preprint
    Alexandra Sasha Luccioni, Sylvain Viguie...
    Nov 3rd, 2022

    Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires significant computational resources, energy and materials. In the present article, we aim to quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle. We estimate that BLOOM's final training emitted approximately 24.7 tonnes of~\carboneq~if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes...

  • Shiki Sato, Yosuke Kishinami, Hiroaki Su...
    |
    Nov 4th, 2022
    |
    conferencePaper
    Shiki Sato, Yosuke Kishinami, Hiroaki Su...
    Nov 4th, 2022

    Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the...

  • Anita Schick, Jasper Feine, Stefan Moran...
    |
    Oct 31st, 2022
    |
    journalArticle
    Anita Schick, Jasper Feine, Stefan Moran...
    Oct 31st, 2022

    Mental disorders in adolescence and young adulthood are major public health concerns. Digital tools such as text-based conversational agents (ie, chatbots) are a promising technology for facilitating mental health assessment. However, the human-like interaction style of chatbots may induce potential biases, such as socially desirable responding (SDR), and may require further effort to complete assessments.

  • Ming Zhong, Yang Liu, Da Yin
    |
    Oct 13th, 2022
    |
    preprint
    Ming Zhong, Yang Liu, Da Yin
    Oct 13th, 2022

    Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Last update from database: 04/04/2025, 02:15 (UTC)
Powered by Zotero and Kerko.