Examining Bias in Automated Scoring of Reading Comprehension Items

Lottridge, Susan; Young, Mackenzie

Open in Zotero

View on zotero.org

Open in Zotero

View on zotero.org

Examining Bias in Automated Scoring of Reading Comprehension Items

Article Status

Published

Authors/contributors

Lottridge, Susan (Author)
Young, Mackenzie (Author)

Title

Examining Bias in Automated Scoring of Reading Comprehension Items

Abstract

The use of automated scoring (AS) of constructed responses has become increasingly common in k - 12 formative, interim, and summative assessment programs. AS has been shown to perform well in essay writing, reading comprehension, and mathematics. However, less is known about how automated scoring engines perform for key subgroups such as gender, race/ethnicity, English proficiency status, disability status, and economic status. Bias evaluations have focused primarily on mean score differences that may occur between the engine scores and the human scores within a subgroup and most published work has been on international examinees using ETS' e-rater system. Gregg et al. (2021) recently examined bias for limited English proficient examinees in United States k-12 examinees on writing and mathematics items, examining engine performance for the presence of bias and examining what elements of the engine processing steps may contribute to bias. This study builds upon the Gregg study by extending the examination to a broader set of subgroups, including gender (Male/Female), economic status (Title I/non-Title I), special education status (SPED/non-SPED) and race/ethnicity (Asian, Black, Hispanic, White). The study examined the performance of the engine on 24 reading comprehension items in grades 3-8 and 11. Bia swas examined on subgroups using the full set of data and when matched using propensity score matching on ability and subgroup covariates. In addition to the usual agreement metrics of quadratic weighted kappa (QWK), exact agreement, and standardized mean difference (SMD), the study also used agreement matrices to examine results for possible bias in patterns of rubric score applications both for human raters and for the automated scoring engine. These matrices show the level of agreement between the second human rater against the first human rater, which can be compared to the level of agreement between the engine and the first human rater. The value of such analyses is that they help to illustrate where in the rubric scale any errors or bias may be appearing and they also help to better understand the pattern of score distributions across the scale, The results of the study indicated that, across methods, the engine showed little evidence of bias for most subgroups on the full sample but showed more evidence of bias when groups were matched. Additionally, the engine showed some bias, as indicated by lower agreements when conditioned on the human score, compared to other subgroups or a matched counterpart at some, but not all, points in the rubric scale. Future work will examine matched responses where the engine does appear to behave differently at locations of the rubric scale (e.g., score points 2 for Black, White, non-Title I and SPED examinees). Such a review may help to identify whether there are patterns in responses that may be causing issues in how the engine is modeling the responses. This review may also uncover any trends in human scoring

Date

2022

Conference Name

NCME

Extra

Citation Key: lottridge2022 Read_Status: New Read_Status_Date: 2025-11-10T07:27:20.611Z

Citation

Lottridge, S., & Young, M. (2022). Examining Bias in Automated Scoring of Reading Comprehension Items. NCME.

Link to this record

https://aievidencehub.org/lib/HGZZW4PD