Search
11 resources
-
Using Multi-label Classification Neural Network to Detect Intersectional DIF with Small Sample SizesYale Quan, Chun Wang|Jan 22nd, 2025|journalArticleYale Quan, Chun WangJan 22nd, 2025
This study introduces InterDIFNet, a multilabel classification neural network for detecting intersectional differential item functioning (DIF) in educational and psychological assessments, with a focus on small sample sizes. Unlike traditional marginal DIF methods, which often fail to capture the effects of intersecting identities and require large samples, InterDIFNet models uniform and non-uniform DIF across multiple intersectional groups simultaneously. The method utilizes an optimized...
-
Jin Wang, Wenxiang Fan|May 6th, 2025|journalArticleJin Wang, Wenxiang FanMay 6th, 2025
-
Yu Wang, Madhumitha Gopalakrishnan, Yoav...|Jan 22nd, 2025|conferencePaperYu Wang, Madhumitha Gopalakrishnan, Yoav...Jan 22nd, 2025
-
Ting Wang, Ying Du, Karen Hoeve|Jan 22nd, 2025|presentationTing Wang, Ying Du, Karen HoeveJan 22nd, 2025
-
Ting Wang, Ying Du, Karen Hoeve|Jan 22nd, 2025|conferencePaperTing Wang, Ying Du, Karen HoeveJan 22nd, 2025
-
Yu Wang, Madhu Gopalakrishnan, Yoav Berg...|Jan 22nd, 2025|presentationYu Wang, Madhu Gopalakrishnan, Yoav Berg...Jan 22nd, 2025
-
Changrong Xiao, Wenxing Ma, Qingping Son...|Mar 3rd, 2025|preprintChangrong Xiao, Wenxing Ma, Qingping Son...Mar 3rd, 2025
Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...
-
Changrong Xiao, Wenxing Ma, Qingping Son...|Mar 3rd, 2025|preprintChangrong Xiao, Wenxing Ma, Qingping Son...Mar 3rd, 2025
Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable...
-
Valentin Hofmann, David Heineman, Ian Ma...|Sep 14th, 2025|preprintValentin Hofmann, David Heineman, Ian Ma...Sep 14th, 2025
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...
-
Lei Huang, Weijiang Yu, Weitao Ma|Jan 24th, 2025|preprintLei Huang, Weijiang Yu, Weitao MaJan 24th, 2025
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), fueling a paradigm shift in information acquisition. Nevertheless, LLMs are prone to hallucination, generating plausible yet nonfactual content. This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval (IR) systems and has attracted intensive research to detect and mitigate such hallucinations. Given the open-ended...
-
Tejal Patwardhan, Rachel Dias, Elizabeth...|Oct 5th, 2025|preprintTejal Patwardhan, Rachel Dias, Elizabeth...Oct 5th, 2025
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and...