FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis

Article Status
Published
Authors/contributors
Title
FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis
Abstract
Large language models (LLMs) are increasingly being deployed in user-facing applications in educational settings. Deployed applications often augment LLMs with fine-tuning, custom system prompts, and moderation layers to achieve particular goals. However, the behaviors of LLM-powered systems are difficult to guarantee, and most existing evaluations focus instead on the performance of unmodified 'foun-dation' models. Tools for evaluating such deployed systems are currently sparse, inflexible, or difficult to use. In this paper , we introduce an open-source tool called FlexEval. Flex-Eval extends OpenAI Evals to allow developers to construct customized, comprehensive automated evaluations of both pre-production and live conversational systems. FlexEval runs locally and can be easily modified to meet the needs of application developers. Developers can evaluate new LLM applications by creating function-based or machine-graded metrics and obtaining results for chat completions or entire conversations. To illustrate FlexEval's utility, we share two use-cases involving content moderation and utterance classification. We built FlexEval to lower the effort required to implement automated testing and evaluation of LLM applications. The code is available on GitHub 1 .
Date
2024-07-24
Short Title
FlexEval
Library Catalogue
ResearchGate
Citation
Christie, S., Moreau-Pernet, B., Tian, Y., & Whitmer, J. (2024, July 24). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. https://doi.org/10.5281/zenodo.12729993
Powered by Zotero and Kerko.