Examining Bias in Automated Scoring of Reading Comprehension Items
Article Status
Published
Authors/contributors
- Lottridge, Susan (Author)
- Young, Mackenzie (Author)
Title
Examining Bias in Automated Scoring of Reading Comprehension Items
Abstract
The use of automated scoring
(AS) of constructed responses has become increasingly common in k -
12 formative, interim, and summative assessment programs. AS has been shown to perform well in essay
writing, reading comprehension, and mathematics. However, less is known about how automated scoring engines perform for key subgroups such as gender, race/ethnicity, English proficiency status, disability status, and economic status. Bias evaluations have focused primarily on mean score differences that may occur between the engine scores and the human scores within a subgroup and most published work has been on international examinees using ETS' e-rater system. Gregg et al. (2021) recently examined bias for limited English proficient examinees in United States k-12 examinees on writing and mathematics items, examining engine performance for the presence of bias and examining what elements of the engine processing steps may contribute to bias. This study builds upon the Gregg study by extending the examination to a broader set of subgroups, including gender (Male/Female), economic status (Title I/non-Title I), special education status (SPED/non-SPED) and race/ethnicity (Asian, Black, Hispanic, White). The
study examined the performance of the engine on 24 reading comprehension items in grades 3-8 and 11.
Bia swas examined on subgroups
using the full set of data and when matched using propensity score
matching on ability and subgroup covariates. In addition to the usual agreement metrics of quadratic
weighted kappa (QWK), exact agreement, and standardized mean
difference (SMD), the study also
used agreement matrices to examine results for possible bias in patterns of rubric score applications both for
human raters and for the automated scoring engine. These matrices show the level of agreement between
the second human rater against the first human rater, which can be compared to the level of agreement
between the engine and the first human rater. The value of such analyses is that they help to illustrate
where in the rubric scale any errors or bias may be appearing and they also help to better understand the pattern of score distributions across the scale,
The results of the study indicated that, across methods, the engine showed little evidence of bias for most subgroups on the full sample but showed more evidence of bias when groups were matched. Additionally,
the engine showed some bias, as indicated by lower agreements when conditioned on the human score,
compared to other subgroups or a matched counterpart at some, but not all, points in the rubric scale. Future work will examine matched responses where the engine does appear to behave differently at locations of the rubric scale (e.g., score points 2 for Black, White, non-Title I and SPED examinees). Such a review may help to identify whether there are patterns in responses that may be causing issues in how the engine is modeling the responses. This review may also uncover any trends in human scoring
Date
2022
Conference Name
NCME
Extra
Citation Key: lottridge2022
Citation
Lottridge, S., & Young, M. (2022). Examining Bias in Automated Scoring of Reading Comprehension Items. NCME.
Link to this record