Results – Evidence Library – Artificial Intelligence in Measurement and Education

Assessing the Accuracy of Automated Writing Evaluation in Predicting ELA Proficiency for Middle-grade ELL and non-ELL Students

Fan Zhang, Joshua Wilson

|

Mar 10th, 2025

|

conferencePaper

Fan Zhang, Joshua Wilson

Mar 10th, 2025

Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions

Mengxue Zhang, Neil Heffernan, Andrew La...

|

Jun 1st, 2023

|

preprint

Mengxue Zhang, Neil Heffernan, Andrew La...

Jun 1st, 2023

Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper,...

L2 Learner Cognitive Psychological Factors About Artificial Intelligence Writing Corrective Feedback

LiQin Wu, Yong Wu, XiangYang Zhang

|

Sep 26th, 2021

|

journalArticle

LiQin Wu, Yong Wu, XiangYang Zhang

Sep 26th, 2021

Although the study of artificial intelligence (AI) used in language teaching and learning is increasingly prevailing, research on language two (L2) learner cognitive psychological factors about AI writing corrective feedback (WCF) is scarce. This paper explores L2 learner cognitive psychology of pigai, an AI evaluating system for English writings in China, from perspectives of perception, noticing, uptake, initiative, retention and emotion. It investigates the consistency between learner...

Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions

Mengxue Zhang, Neil Heffernan, Andrew La...

|

Jun 1st, 2023

|

preprint

Mengxue Zhang, Neil Heffernan, Andrew La...

Jun 1st, 2023

Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper,...

How Do Educationally At-Risk Men and Women Differ in Their Essay-Writing Processes?

Randy Elliot Bennett, Mo Zhang, Sandip S...

|

Mar 10th, 2021

|

journalArticle

Randy Elliot Bennett, Mo Zhang, Sandip S...

Mar 10th, 2021

This study examined differences in the composition processes used by educationally at-risk males and females who wrote essays as part of a high-school equivalency examination. Over 30,000 individuals were assessed, each taking one of 12 forms of the examination’s language arts writing subtest in 23 US states. Writing processes were inferred using features extracted from keystroke logs and aggregated into seven composite indicators. Results showed that females earned higher essay and total...

MxML (Exploring the paradigmatic relationship between measurement and machine learning in the history, current time, and future): Current state-of-the-field

Yi Zheng, Steven Nydick, Sijia Huang

|

Apr 12th, 2023

|

conferencePaper

Yi Zheng, Steven Nydick, Sijia Huang

Apr 12th, 2023

The recent surge of machine learning (ML) has impacted many disciplines, including educational and psychological measurement (hereafter shortened as measurement, “M”). The measurement literature has seen a rapid growth in studies that explore using ML methods to solve measurement problems. However, there exist gaps between the typical paradigm of ML and fundamental principles of measurement. The MxML project was created to explore how the measurement community might potentially redefine the...

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai

|

Oct 22nd, 2023

|

preprint

Ziang Xiao, Susu Zhang, Vivian Lai

Oct 22nd, 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai

|

Oct 22nd, 2023

|

preprint

Ziang Xiao, Susu Zhang, Vivian Lai

Oct 22nd, 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

Automated Scoring of Chinese Grades 7–9 Students’ Competence in Interpreting and Arguing from Evidence

Cong Wang, Xiufeng Liu, Lei Wang

|

Apr 10th, 2021

|

journalArticle

Cong Wang, Xiufeng Liu, Lei Wang

Apr 10th, 2021

Automated Scoring of Chinese Grades 7–9 Students’ Competence in Interpreting and Arguing from Evidence

Cong Wang, Xiufeng Liu, Lei Wang

|

Sep 9th, 2020

|

journalArticle

Cong Wang, Xiufeng Liu, Lei Wang

Sep 9th, 2020

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu

|

Feb 24th, 2020

|

preprint

Tianyi Zhang, Varsha Kishore, Felix Wu

Feb 24th, 2020

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

Rose E. Wang, Qingyang Zhang, Carly Robi...

|

Mar 10th, 2024

|

preprint

Rose E. Wang, Qingyang Zhang, Carly Robi...

Mar 10th, 2024

Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought...

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

Han Yang, Mingchen Li, Huixue Zhou

|

Dec 24th, 2023

|

preprint

Han Yang, Mingchen Li, Huixue Zhou

Dec 24th, 2023

To enhance the accuracy and reliability of diverse medical question-answering (QA) tasks and investigate efficient approaches deploying the Large Language Models (LLM) technologies, We developed a novel ensemble learning pipeline by utilizing state-of-the-art LLMs, focusing on improving performance on diverse medical QA datasets.Materials and MethodsOur study employs three medical QA datasets: PubMedQA, MedQA-USMLE, and MedMCQA, each presenting unique challenges in biomedical...

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Son...

|

Mar 3rd, 2025

|

preprint

Changrong Xiao, Wenxing Ma, Qingping Son...

Mar 3rd, 2025

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Son...

|

Mar 3rd, 2025

|

preprint

Changrong Xiao, Wenxing Ma, Qingping Son...

Mar 3rd, 2025

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

Yu Li, Shenyu Zhang, Rui Wu

|

Mar 10th, 2024

|

preprint

Yu Li, Shenyu Zhang, Rui Wu

Mar 10th, 2024

Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models, especially in open-ended text, has consistently presented a significant challenge. Addressing this, recent work has explored the possibility of using LLMs as evaluators. While using a single LLM as an evaluation agent shows potential, it is filled with...

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su

|

Aug 14th, 2023

|

preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su

|

Aug 14th, 2023

|

preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

Predictors of middle school students’ perceptions of automated writing evaluation

Joshua Wilson, Fan Zhang, Corey Palermo,...

|

Apr 10th, 2024

|

journalArticle

Joshua Wilson, Fan Zhang, Corey Palermo,...

Apr 10th, 2024

This study examined middle school students' perceptions of an automated writing evaluation (AWE) system, MI Write. We summarize students' perceptions of MI Write's usability, usefulness, and desirability both quantitatively and qualitatively. We then estimate hierarchical entry regression models that account for district context, classroom climate, demographic factors (i.e., gender, special education status, limited English proficiency status, socioeconomic status, grade), students'...

Qualitative Coding with GPT-4: Where it Works Better

Xiner Liu, Andrés Zambrano, Ryan Baker

|

Mar 5th, 2025

|

journalArticle

Xiner Liu, Andrés Zambrano, Ryan Baker

Mar 5th, 2025

This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies-Zero-shot, Few-shot, and Few-shot with contextual information-as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I...

Search

Publication year