Search
5 resources
-
Weizhe Yuan, Graham Neubig, Pengfei Liu,...|Apr 3rd, 2021|journalArticleWeizhe Yuan, Graham Neubig, Pengfei Liu,...Apr 3rd, 2021
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...
-
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...|Apr 3rd, 2023|journalArticleJinlan Fu, See-Kiong Ng, Zhengbao Jiang,...Apr 3rd, 2023
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...
-
Yiqing Xie, Alex Xie, Divyanshu Sheth|Mar 31st, 2024|preprintYiqing Xie, Alex Xie, Divyanshu ShethMar 31st, 2024
To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...
-
Yiqing Xie, Alex Xie, Divyanshu Sheth|Mar 31st, 2024|preprintYiqing Xie, Alex Xie, Divyanshu ShethMar 31st, 2024
To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...
-
Ming Zhong, Yang Liu, Da Yin|Oct 13th, 2022|preprintMing Zhong, Yang Liu, Da YinOct 13th, 2022
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...