Results – Evidence Library – Artificial Intelligence in Measurement and Education

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Apr 4th, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Apr 4th, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Oct 13th, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Oct 13th, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Apr 4th, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Apr 4th, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

|

Apr 4th, 2023

|

journalArticle

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

Apr 4th, 2023

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu

|

May 23rd, 2023

|

preprint

Yang Liu, Dan Iter, Yichong Xu

May 23rd, 2023

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references....

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su

|

Aug 14th, 2023

|

preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma

|

Nov 9th, 2023

|

preprint

Lei Huang, Weijiang Yu, Weitao Ma

Nov 9th, 2023

The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of...

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

|

Aug 15th, 2024

|

preprint

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

Aug 15th, 2024

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...

Search

Technical methods

Publication year