Evaluating and Optimizing Educational Content with Large Language Model Judgments

Article Status

Published

Authors/contributors

He-Yueya, Joy (Author)
Goodman, Noah D. (Author)
Brunskill, Emma (Author)

Title

Evaluating and Optimizing Educational Content with Large Language Model Judgments

Abstract

Creating effective educational materials generally requires expensive and time-consuming studies of student learning outcomes. To overcome this barrier, one idea is to build computational models of student learning and use them to optimize instructional materials. However, it is difficult to model the cognitive processes of learning dynamics. We propose an alternative approach that uses Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. Specifically, we use GPT-3.5 to evaluate the overall effect of instructional materials on different student groups and find that it can replicate well-established educational findings such as the Expertise Reversal Effect and the Variability Effect. This demonstrates the potential of LMs as reliable evaluators of educational content. Building on this insight, we introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. We apply this approach to create math word problem worksheets aimed at maximizing student learning gains. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences. We conclude by discussing potential divergences between human and LM opinions and the resulting pitfalls of automating instructional design.

Repository

arXiv

Archive ID

arXiv:2403.02795

Date

May 6th, 2024

DOI

10.48550/arXiv.2403.02795

URL

http://arxiv.org/abs/2403.02795

Accessed

20/06/2024, 08:21

Library Catalogue

arXiv.org

Extra

arXiv:2403.02795 [cs] <AI Smry>: This work uses GPT-3.5 to evaluate the overall effect of instructional materials on different student groups and finds that it can replicate well-established educational findings such as the Expertise Reversal Effect and the Variability Effect.

Citation

He-Yueya, J., Goodman, N. D., & Brunskill, E. (2024). Evaluating and Optimizing Educational Content with Large Language Model Judgments (arXiv:2403.02795). arXiv. https://doi.org/10.48550/arXiv.2403.02795

Link to this record

https://aievidencehub.org/lib/XN8KZ34M