Results – Evidence Library – Artificial Intelligence in Measurement and Education

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Jun 2nd, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Jun 2nd, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Jun 2nd, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Jun 2nd, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie, Alex Xie, Divyanshu Sheth

|

Mar 31st, 2024

|

preprint

Yiqing Xie, Alex Xie, Divyanshu Sheth

Mar 31st, 2024

To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie, Alex Xie, Divyanshu Sheth

|

Mar 31st, 2024

|

preprint

Yiqing Xie, Alex Xie, Divyanshu Sheth

Mar 31st, 2024

To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Oct 13th, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Oct 13th, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Search

Technical methods

Publication year