63 resources

  • Zach Tilton, John M. LaVelle, Tian Ford,...
    |
    Jun 14th, 2023
    |
    journalArticle
    Zach Tilton, John M. LaVelle, Tian Ford,...
    Jun 14th, 2023

    Advancements in Artificial Intelligence (AI) signal a paradigmatic shift with the potential for transforming many various aspects of society, including evaluation education, with implications for subsequent evaluation practice. This article explores the potential implications of AI for evaluator and evaluation education. Specifically, the article discusses key issues in evaluation education including equitable language access to evaluation education, navigating program, social science, and...

  • Tim Dettmers, Artidoro Pagnoni, Ari Holt...
    |
    May 23rd, 2023
    |
    preprint
    Tim Dettmers, Artidoro Pagnoni, Ari Holt...
    May 23rd, 2023

    We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while...

  • Yang Liu, Dan Iter, Yichong Xu
    |
    May 23rd, 2023
    |
    preprint
    Yang Liu, Dan Iter, Yichong Xu
    May 23rd, 2023

    The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references....

  • Jason Wei, Xuezhi Wang, Dale Schuurmans,...
    |
    Jan 10th, 2023
    |
    preprint
    Jason Wei, Xuezhi Wang, Dale Schuurmans,...
    Jan 10th, 2023

    We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of...

  • Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...
    |
    Mar 14th, 2023
    |
    journalArticle
    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...
    Mar 14th, 2023

    Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

  • Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...
    |
    Mar 14th, 2023
    |
    journalArticle
    Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...
    Mar 14th, 2023

    This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

  • Jiaan Wang, Yunlong Liang, Fandong Meng,...
    |
    Mar 14th, 2023
    |
    journalArticle
    Jiaan Wang, Yunlong Liang, Fandong Meng,...
    Mar 14th, 2023

    Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with...

  • Shuyan Zhou, Uri Alon, Sumit Agarwal
    |
    Mar 14th, 2023
    |
    conferencePaper
    Shuyan Zhou, Uri Alon, Sumit Agarwal
    Mar 14th, 2023
  • Shiki Sato, Yosuke Kishinami, Hiroaki Su...
    |
    Nov 14th, 2022
    |
    conferencePaper
    Shiki Sato, Yosuke Kishinami, Hiroaki Su...
    Nov 14th, 2022

    Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the...

  • Anita Schick, Jasper Feine, Stefan Moran...
    |
    Oct 31st, 2022
    |
    journalArticle
    Anita Schick, Jasper Feine, Stefan Moran...
    Oct 31st, 2022

    Mental disorders in adolescence and young adulthood are major public health concerns. Digital tools such as text-based conversational agents (ie, chatbots) are a promising technology for facilitating mental health assessment. However, the human-like interaction style of chatbots may induce potential biases, such as socially desirable responding (SDR), and may require further effort to complete assessments.

  • Ming Zhong, Yang Liu, Da Yin
    |
    Oct 13th, 2022
    |
    preprint
    Ming Zhong, Yang Liu, Da Yin
    Oct 13th, 2022

    Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

  • Cyril Chhun, Pierre Colombo, Chloé Clave...
    |
    Sep 15th, 2022
    |
    preprint
    Cyril Chhun, Pierre Colombo, Chloé Clave...
    Sep 15th, 2022

    Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10...

  • Pierre Jean A. Colombo, Chloé Clavel, Pa...
    |
    Jun 28th, 2022
    |
    journalArticle
    Pierre Jean A. Colombo, Chloé Clavel, Pa...
    Jun 28th, 2022

    Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce...

  • Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...
    |
    Oct 31st, 2021
    |
    journalArticle
    Kerstin Denecke, Alaa Abd-Alrazaq, Mowaf...
    Oct 31st, 2021

    Abstract Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent...

  • Vasilis Efthymiou, Kostas Stefanidis, Ev...
    |
    Oct 26th, 2021
    |
    conferencePaper
    Vasilis Efthymiou, Kostas Stefanidis, Ev...
    Oct 26th, 2021
  • Gabriel Oliveira dos Santos, Esther Luna...
    |
    Sep 28th, 2021
    |
    preprint
    Gabriel Oliveira dos Santos, Esther Luna...
    Sep 28th, 2021

    This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high...

  • Elizabeth Clark, Tal August, Sofia Serra...
    |
    Jul 7th, 2021
    |
    preprint
    Elizabeth Clark, Tal August, Sofia Serra...
    Jul 7th, 2021

    Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore...

  • Xu Han, Michelle Zhou, Matthew J. Turner...
    |
    May 6th, 2021
    |
    conferencePaper
    Xu Han, Michelle Zhou, Matthew J. Turner...
    May 6th, 2021
  • Michael McTear, Michael McTear
    |
    Mar 14th, 2021
    |
    bookSection
    Michael McTear, Michael McTear
    Mar 14th, 2021
  • University of Wolverhampton, UK, Hadeel ...
    |
    Mar 14th, 2021
    |
    conferencePaper
    University of Wolverhampton, UK, Hadeel ...
    Mar 14th, 2021
Last update from database: 02/03/2025, 19:15 (UTC)
Powered by Zotero and Kerko.