Search

In authors or contributors

"Zhang"

Technical methods

model evaluation subgroup

Publication year

Between 2000 and 2025

Reset search

4 resources

Abstracts

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai
|
Oct 22nd, 2023
|
preprint

Ziang Xiao, Susu Zhang, Vivian Lai

Oct 22nd, 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...
BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu
|
Feb 24th, 2020
|
preprint

Tianyi Zhang, Varsha Kishore, Felix Wu

Feb 24th, 2020

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su
|
Aug 14th, 2023
|
preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...
|
Aug 15th, 2024
|
preprint

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

Aug 15th, 2024

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...

Custom feed

Last update from database: 04/06/2025, 21:15 (UTC)