Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation: A Feasibility Study

Gorgun, Guher; Bulut, Okan

doi:10.1111/emip.12663

Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation: A Feasibility Study

Article Status

Published

Authors/contributors

Gorgun, Guher (Author)
Bulut, Okan (Author)

Title

Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation: A Feasibility Study

Abstract

Automatic item generation may supply many items instantly and efficiently to assessment and learning environments. Yet, the evaluation of item quality persists to be a bottleneck for deploying generated items in learning and assessment settings. In this study, we investigated the utility of using large‐language models, specifically Llama 3‐8B, for evaluating automatically generated cloze items. The trained large‐language model was able to filter out majority of good and bad items accurately. Evaluating items automatically with instruction‐tuned LLMs may aid educators and test developers in understanding the quality of items created in an efficient and scalable manner. The item evaluation process with LLMs may also act as an intermediate step between item creation and field testing to reduce the cost and time associated with multiple rounds of revision.

Publication

Educational Measurement: Issues and Practice

Date

2024-12-19

Volume

44

Issue

1

Pages

96-107

Journal Abbr

Educational Measurement

DOI

10.1111/emip.12663

Citation Key

gorgun2024

URL

https://onlinelibrary.wiley.com/doi/10.1111/emip.12663

Accessed

07/02/2025, 15:32

ISSN

0731-1745

Short Title

Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation

Language

en

Library Catalogue

DOI.org (Crossref)

Extra

<标题>: 用于自动题目生成质量控制的指令调优大型语言模型：可行性研究 <AI Smry>: The utility of using large‐language models, specifically Llama 3‐8B, for evaluating automatically generated cloze items was investigated and the trained large‐language model was able to filter out majority of good and bad items accurately. Read_Status: New Read_Status_Date: 2026-01-26T11:33:53.529Z

Citation

Gorgun, G., & Bulut, O. (2024). Instruction‐Tuned Large‐Language Models for Quality Control in Automatic Item Generation: A Feasibility Study. Educational Measurement: Issues and Practice, 44(1), 96–107. https://doi.org/10.1111/emip.12663

Link to this record

https://aievidencehub.org/lib/CYJH453R