32 resources

  • Fan Zhang, Joshua Wilson
    |
    Dec 14th, 2025
    |
    conferencePaper
    Fan Zhang, Joshua Wilson
    Dec 14th, 2025
  • LiQin Wu, Yong Wu, XiangYang Zhang
    |
    Sep 26th, 2021
    |
    journalArticle
    LiQin Wu, Yong Wu, XiangYang Zhang
    Sep 26th, 2021

    Although the study of artificial intelligence (AI) used in language teaching and learning is increasingly prevailing, research on language two (L2) learner cognitive psychological factors about AI writing corrective feedback (WCF) is scarce. This paper explores L2 learner cognitive psychology of pigai, an AI evaluating system for English writings in China, from perspectives of perception, noticing, uptake, initiative, retention and emotion. It investigates the consistency between learner...

  • Mengxue Zhang, Neil Heffernan, Andrew La...
    |
    Jun 1st, 2023
    |
    preprint
    Mengxue Zhang, Neil Heffernan, Andrew La...
    Jun 1st, 2023

    Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper,...

  • Mengxue Zhang, Neil Heffernan, Andrew La...
    |
    Jun 1st, 2023
    |
    preprint
    Mengxue Zhang, Neil Heffernan, Andrew La...
    Jun 1st, 2023

    Automated scoring of student responses to open-ended questions, including short-answer questions, has great potential to scale to a large number of responses. Recent approaches for automated scoring rely on supervised learning, i.e., training classifiers or fine-tuning language models on a small number of responses with human-provided score labels. However, since scoring is a subjective process, these human scores are noisy and can be highly variable, depending on the scorer. In this paper,...

  • Randy Elliot Bennett, Mo Zhang, Sandip S...
    |
    Dec 14th, 2021
    |
    journalArticle
    Randy Elliot Bennett, Mo Zhang, Sandip S...
    Dec 14th, 2021

    This study examined differences in the composition processes used by educationally at-risk males and females who wrote essays as part of a high-school equivalency examination. Over 30,000 individuals were assessed, each taking one of 12 forms of the examination’s language arts writing subtest in 23 US states. Writing processes were inferred using features extracted from keystroke logs and aggregated into seven composite indicators. Results showed that females earned higher essay and total...

  • Yi Zheng, Steven Nydick, Sijia Huang
    |
    Apr 12th, 2023
    |
    conferencePaper
    Yi Zheng, Steven Nydick, Sijia Huang
    Apr 12th, 2023

    The recent surge of machine learning (ML) has impacted many disciplines, including educational and psychological measurement (hereafter shortened as measurement, “M”). The measurement literature has seen a rapid growth in studies that explore using ML methods to solve measurement problems. However, there exist gaps between the typical paradigm of ML and fundamental principles of measurement. The MxML project was created to explore how the measurement community might potentially redefine the...

  • Ziang Xiao, Susu Zhang, Vivian Lai
    |
    Oct 22nd, 2023
    |
    preprint
    Ziang Xiao, Susu Zhang, Vivian Lai
    Oct 22nd, 2023

    We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

  • Ziang Xiao, Susu Zhang, Vivian Lai
    |
    Oct 22nd, 2023
    |
    preprint
    Ziang Xiao, Susu Zhang, Vivian Lai
    Oct 22nd, 2023

    We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

  • Cong Wang, Xiufeng Liu, Lei Wang
    |
    Apr 14th, 2021
    |
    journalArticle
    Cong Wang, Xiufeng Liu, Lei Wang
    Apr 14th, 2021
  • Cong Wang, Xiufeng Liu, Lei Wang
    |
    Sep 9th, 2020
    |
    journalArticle
    Cong Wang, Xiufeng Liu, Lei Wang
    Sep 9th, 2020
  • Rose E. Wang, Qingyang Zhang, Carly Robi...
    |
    Dec 14th, 2024
    |
    preprint
    Rose E. Wang, Qingyang Zhang, Carly Robi...
    Dec 14th, 2024

    Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought...

  • Tianyi Zhang, Varsha Kishore, Felix Wu
    |
    Feb 24th, 2020
    |
    preprint
    Tianyi Zhang, Varsha Kishore, Felix Wu
    Feb 24th, 2020

    We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...

  • Han Yang, Mingchen Li, Huixue Zhou
    |
    Dec 24th, 2023
    |
    preprint
    Han Yang, Mingchen Li, Huixue Zhou
    Dec 24th, 2023

    To enhance the accuracy and reliability of diverse medical question-answering (QA) tasks and investigate efficient approaches deploying the Large Language Models (LLM) technologies, We developed a novel ensemble learning pipeline by utilizing state-of-the-art LLMs, focusing on improving performance on diverse medical QA datasets.Materials and MethodsOur study employs three medical QA datasets: PubMedQA, MedQA-USMLE, and MedMCQA, each presenting unique challenges in biomedical...

  • Changrong Xiao, Wenxing Ma, Qingping Son...
    |
    Mar 3rd, 2025
    |
    preprint
    Changrong Xiao, Wenxing Ma, Qingping Son...
    Mar 3rd, 2025

    Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...

  • Changrong Xiao, Wenxing Ma, Qingping Son...
    |
    Mar 3rd, 2025
    |
    preprint
    Changrong Xiao, Wenxing Ma, Qingping Son...
    Mar 3rd, 2025

    Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...

  • Yu Li, Shenyu Zhang, Rui Wu
    |
    Dec 14th, 2024
    |
    preprint
    Yu Li, Shenyu Zhang, Rui Wu
    Dec 14th, 2024

    Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models, especially in open-ended text, has consistently presented a significant challenge. Addressing this, recent work has explored the possibility of using LLMs as evaluators. While using a single LLM as an evaluation agent shows potential, it is filled with...

  • Chi-Min Chan, Weize Chen, Yusheng Su
    |
    Aug 14th, 2023
    |
    preprint
    Chi-Min Chan, Weize Chen, Yusheng Su
    Aug 14th, 2023

    Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

  • Chi-Min Chan, Weize Chen, Yusheng Su
    |
    Aug 14th, 2023
    |
    preprint
    Chi-Min Chan, Weize Chen, Yusheng Su
    Aug 14th, 2023

    Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

  • Joshua Wilson, Fan Zhang, Corey Palermo,...
    |
    Apr 14th, 2024
    |
    journalArticle
    Joshua Wilson, Fan Zhang, Corey Palermo,...
    Apr 14th, 2024

    This study examined middle school students' perceptions of an automated writing evaluation (AWE) system, MI Write. We summarize students' perceptions of MI Write's usability, usefulness, and desirability both quantitatively and qualitatively. We then estimate hierarchical entry regression models that account for district context, classroom climate, demographic factors (i.e., gender, special education status, limited English proficiency status, socioeconomic status, grade), students'...

  • Xiner Liu, Andrés Zambrano, Ryan Baker
    |
    Mar 5th, 2025
    |
    journalArticle
    Xiner Liu, Andrés Zambrano, Ryan Baker
    Mar 5th, 2025

    This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies-Zero-shot, Few-shot, and Few-shot with contextual information-as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I...

Last update from database: 14/12/2025, 20:15 (UTC)
Powered by Zotero and Kerko.