FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis

Christie, S; Moreau-Pernet, Baptiste; Tian, Yu; Whitmer, John

doi:10.5281/zenodo.12729993

FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis

Article Status

Published

Authors/contributors

Christie, S (Author)
Moreau-Pernet, Baptiste (Author)
Tian, Yu (Author)
Whitmer, John (Author)

Title

FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis

Abstract

Large language models (LLMs) are increasingly being deployed in user-facing applications in educational settings. Deployed applications often augment LLMs with fine-tuning, custom system prompts, and moderation layers to achieve particular goals. However, the behaviors of LLM-powered systems are difficult to guarantee, and most existing evaluations focus instead on the performance of unmodified 'foun-dation' models. Tools for evaluating such deployed systems are currently sparse, inflexible, or difficult to use. In this paper , we introduce an open-source tool called FlexEval. Flex-Eval extends OpenAI Evals to allow developers to construct customized, comprehensive automated evaluations of both pre-production and live conversational systems. FlexEval runs locally and can be easily modified to meet the needs of application developers. Developers can evaluate new LLM applications by creating function-based or machine-graded metrics and obtaining results for chat completions or entire conversations. To illustrate FlexEval's utility, we share two use-cases involving content moderation and utterance classification. We built FlexEval to lower the effort required to implement automated testing and evaluation of LLM applications. The code is available on GitHub 1 .

Date

2024-07-24

DOI

10.5281/zenodo.12729993

Short Title

FlexEval

Library Catalogue

ResearchGate

Citation

Christie, S., Moreau-Pernet, B., Tian, Y., & Whitmer, J. (2024, July 24). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. https://doi.org/10.5281/zenodo.12729993

Link to this record

https://aievidencehub.org/lib/5VF96QFB