Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance

Khademi Khademi University of Massachusetts, Am, Abdolvahab

doi:10.37074/jalt.2023.6.1.28

Return

Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance

Article Status

Published

Author/contributor

Khademi Khademi University of Massachusetts, Am, Abdolvahab (Author)

Title

Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance

Abstract

ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, we measure the reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts. Intraclass correlation (ICC) as a performance metric showed that the inter-reliability of both the OpenAI ChatGPT and the Google Bard were low against the gold standard of human ratings.

Publication

Journal of Applied Learning & Teaching

Date

2023-5-10

Volume

6

Issue

1

Journal Abbr

JALT

DOI

10.37074/jalt.2023.6.1.28

Citation Key

khademikhademiuniversityofmassachusettsam2023

URL

https://journals.sfu.ca/jalt/index.php/jalt/article/view/783

Accessed

17/09/2025, 23:57

ISSN

2591-801X, 2591-801X

Short Title

Can ChatGPT and Bard generate aligned assessment items?

Language

en

Library Catalogue

DOI.org (Crossref)

Extra

Read_Status: New Read_Status_Date: 2026-01-26T11:33:20.824Z

Citation

Khademi Khademi University of Massachusetts, Am, A. (2023). Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance. Journal of Applied Learning & Teaching, 6(1). https://doi.org/10.37074/jalt.2023.6.1.28

Link to this record

https://aievidencehub.org/lib/FJYAVJSB