Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Article Status
Published
Authors/contributors
- Wang, Siyuan (Author)
- Long, Zhuohan (Author)
- Fan, Zhihao (Author)
- Wei, Zhongyu (Author)
- Huang, Xuanjing (Author)
Title
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Abstract
This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).
Repository
arXiv
Archive ID
arXiv:2402.11443
Date
2024-02-17
Citation Key
wang2024d
Accessed
04/09/2024, 14:12
Short Title
Benchmark Self-Evolving
Library Catalogue
Extra
arXiv:2402.11443 [cs]
<标题>: 基准自我进化:用于动态大语言模型评估的多智能体框架
Citation
Wang, S., Long, Z., Fan, Z., Wei, Z., & Huang, X. (2024). Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation (arXiv:2402.11443). arXiv. http://arxiv.org/abs/2402.11443
Link to this record