Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Article Status
Published
Authors/contributors
- Wang, Peiyi (Author)
- Li, Lei (Author)
- Shao, Zhihong (Author)
- Xu, R. X. (Author)
- Dai, Damai (Author)
- Li, Yifei (Author)
- Chen, Deli (Author)
- Wu, Y. (Author)
- Sui, Zhifang (Author)
Title
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Abstract
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.
Repository
arXiv
Archive ID
arXiv:2312.08935
Date
2024-02-19
Citation Key
wang2024
Accessed
24/07/2024, 14:54
Short Title
Math-Shepherd
Library Catalogue
Extra
arXiv:2312.08935 [cs]
<AI Smry>: An innovative process-oriented math process reward model called Math-Shepherd, which assigns a reward score to each step of math problem solutions, which holds significant potential for the future evolution of LLMs.
<标题>: 数学牧羊人:在无需人工标注的情况下逐步验证和强化大型语言模型
Citation
Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., & Sui, Z. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (arXiv:2312.08935). arXiv. https://doi.org/10.48550/arXiv.2312.08935
Link to this record