CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Article Status

Published

Authors/contributors

Xie, Yiqing (Author)
Xie, Alex (Author)
Sheth, Divyanshu (Author)
Liu, Pengfei (Author)
Fried, Daniel (Author)
Rose, Carolyn (Author)

Title

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Abstract

To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries revised from code in 367 GitHub repositories taken from the CodeSearchNet dataset. To demonstrate the complexity and solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as ``requires effort to solve''. We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We will release the code of both the framework and the dataset upon acceptance.

Repository

arXiv

Archive ID

arXiv:2404.00566

Date

2024-03-31

URL

http://arxiv.org/abs/2404.00566

Accessed

17/04/2024, 22:14

Short Title

CodeBenchGen

Library Catalogue

arXiv.org

Extra

arXiv:2404.00566 [cs]

Citation

Xie, Y., Xie, A., Sheth, D., Liu, P., Fried, D., & Rose, C. (2024). CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks (arXiv:2404.00566). arXiv. http://arxiv.org/abs/2404.00566

Link to this record

https://aievidencehub.org/lib/4JSEMAXP