Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Article Status

Published

Authors/contributors

Wang, Siyuan (Author)
Long, Zhuohan (Author)
Fan, Zhihao (Author)
Wei, Zhongyu (Author)
Huang, Xuanjing (Author)

Title

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Abstract

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).

Repository

arXiv

Archive ID

arXiv:2402.11443

Date

2024-02-17

Citation Key

wang2024b

URL

http://arxiv.org/abs/2402.11443

Accessed

04/09/2024, 14:12

Short Title

Benchmark Self-Evolving

Library Catalogue

arXiv.org

Extra

arXiv:2402.11443 [cs] <标题>: 基准自我进化：用于动态大语言模型评估的多智能体框架 Read_Status: New Read_Status_Date: 2026-01-26T11:32:36.950Z

Citation

Wang, S., Long, Z., Fan, Z., Wei, Z., & Huang, X. (2024). Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation (arXiv:2402.11443). arXiv. http://arxiv.org/abs/2402.11443

Link to this record

https://aievidencehub.org/lib/HFMDRWYR