Search
32 resources
-
Fan Zhang, Joshua Wilson|Dec 14th, 2025|conferencePaperFan Zhang, Joshua WilsonDec 14th, 2025
-
L2 Learner Cognitive Psychological Factors About Artificial Intelligence Writing Corrective FeedbackLiQin Wu, Yong Wu, XiangYang Zhang|Sep 26th, 2021|journalArticleLiQin Wu, Yong Wu, XiangYang ZhangSep 26th, 2021
Although the study of artificial intelligence (AI) used in language teaching and learning is increasingly prevailing, research on language two (L2) learner cognitive psychological factors about AI writing corrective feedback (WCF) is scarce. This paper explores L2 learner cognitive psychology of pigai, an AI evaluating system for English writings in China, from perspectives of perception, noticing, uptake, initiative, retention and emotion. It investigates the consistency between learner...
-
Mengxue Zhang, Neil Heffernan, Andrew La...|Jun 1st, 2023|preprintMengxue Zhang, Neil Heffernan, Andrew La...Jun 1st, 2023
Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper,...
-
Mengxue Zhang, Neil Heffernan, Andrew La...|Jun 1st, 2023|preprintMengxue Zhang, Neil Heffernan, Andrew La...Jun 1st, 2023
Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper,...
-
Randy Elliot Bennett, Mo Zhang, Sandip S...|Dec 14th, 2021|journalArticleRandy Elliot Bennett, Mo Zhang, Sandip S...Dec 14th, 2021
This study examined differences in the composition processes used by educationally at-risk males and females who wrote essays as part of a high-school equivalency examination. Over 30,000 individuals were assessed, each taking one of 12 forms of the examination’s language arts writing subtest in 23 US states. Writing processes were inferred using features extracted from keystroke logs and aggregated into seven composite indicators. Results showed that females earned higher essay and total...
-
Yi Zheng, Steven Nydick, Sijia Huang|Apr 12th, 2023|conferencePaperYi Zheng, Steven Nydick, Sijia HuangApr 12th, 2023
The recent surge of machine learning (ML) has impacted many disciplines, including educational and psychological measurement (hereafter shortened as measurement, “M”). The measurement literature has seen a rapid growth in studies that explore using ML methods to solve measurement problems. However, there exist gaps between the typical paradigm of ML and fundamental principles of measurement. The MxML project was created to explore how the measurement community might potentially redefine the...
-
Ziang Xiao, Susu Zhang, Vivian Lai|Oct 22nd, 2023|preprintZiang Xiao, Susu Zhang, Vivian LaiOct 22nd, 2023
We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...
-
Ziang Xiao, Susu Zhang, Vivian Lai|Oct 22nd, 2023|preprintZiang Xiao, Susu Zhang, Vivian LaiOct 22nd, 2023
We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...
-
Cong Wang, Xiufeng Liu, Lei Wang|Apr 14th, 2021|journalArticleCong Wang, Xiufeng Liu, Lei WangApr 14th, 2021
-
Cong Wang, Xiufeng Liu, Lei Wang|Sep 9th, 2020|journalArticleCong Wang, Xiufeng Liu, Lei WangSep 9th, 2020
-
Rose E. Wang, Qingyang Zhang, Carly Robi...|Dec 14th, 2024|preprintRose E. Wang, Qingyang Zhang, Carly Robi...Dec 14th, 2024
Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought...
-
Tianyi Zhang, Varsha Kishore, Felix Wu|Feb 24th, 2020|preprintTianyi Zhang, Varsha Kishore, Felix WuFeb 24th, 2020
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...
-
Han Yang, Mingchen Li, Huixue Zhou|Dec 24th, 2023|preprintHan Yang, Mingchen Li, Huixue ZhouDec 24th, 2023
To enhance the accuracy and reliability of diverse medical question-answering (QA) tasks and investigate efficient approaches deploying the Large Language Models (LLM) technologies, We developed a novel ensemble learning pipeline by utilizing state-of-the-art LLMs, focusing on improving performance on diverse medical QA datasets.Materials and MethodsOur study employs three medical QA datasets: PubMedQA, MedQA-USMLE, and MedMCQA, each presenting unique challenges in biomedical...
-
Changrong Xiao, Wenxing Ma, Qingping Son...|Mar 3rd, 2025|preprintChangrong Xiao, Wenxing Ma, Qingping Son...Mar 3rd, 2025
Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...
-
Changrong Xiao, Wenxing Ma, Qingping Son...|Mar 3rd, 2025|preprintChangrong Xiao, Wenxing Ma, Qingping Son...Mar 3rd, 2025
Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...
-
Yu Li, Shenyu Zhang, Rui Wu|Dec 14th, 2024|preprintYu Li, Shenyu Zhang, Rui WuDec 14th, 2024
Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models, especially in open-ended text, has consistently presented a significant challenge. Addressing this, recent work has explored the possibility of using LLMs as evaluators. While using a single LLM as an evaluation agent shows potential, it is filled with...
-
Chi-Min Chan, Weize Chen, Yusheng Su|Aug 14th, 2023|preprintChi-Min Chan, Weize Chen, Yusheng SuAug 14th, 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
-
Chi-Min Chan, Weize Chen, Yusheng Su|Aug 14th, 2023|preprintChi-Min Chan, Weize Chen, Yusheng SuAug 14th, 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
-
Joshua Wilson, Fan Zhang, Corey Palermo,...|Apr 14th, 2024|journalArticleJoshua Wilson, Fan Zhang, Corey Palermo,...Apr 14th, 2024
This study examined middle school students' perceptions of an automated writing evaluation (AWE) system, MI Write. We summarize students' perceptions of MI Write's usability, usefulness, and desirability both quantitatively and qualitatively. We then estimate hierarchical entry regression models that account for district context, classroom climate, demographic factors (i.e., gender, special education status, limited English proficiency status, socioeconomic status, grade), students'...
-
Xiner Liu, Andrés Zambrano, Ryan Baker|Mar 5th, 2025|journalArticleXiner Liu, Andrés Zambrano, Ryan BakerMar 5th, 2025
This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies-Zero-shot, Few-shot, and Few-shot with contextual information-as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I...