Search
702 resources
-
Nigel Fernandez, Aritra Ghosh, Naiming L...|Oct 28th, 2022|preprintNigel Fernandez, Aritra Ghosh, Naiming L...Oct 28th, 2022
Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two...
-
Steve Ferrara, Saed Qunbar|Sep 28th, 2022|journalArticleSteve Ferrara, Saed QunbarSep 28th, 2022
Abstract In this article, we argue that automated scoring engines should be transparent and construct relevant—that is, as much as is currently feasible. Many current automated scoring engines cannot achieve high degrees of scoring accuracy without allowing in some features that may not be easily explained and understood and may not be obviously and directly relevant to the target assessment construct. We address the current limitations on evidence and validity arguments for...
-
Yong He, Shumin Jing, Y Lu|Oct 28th, 2022|conferencePaperYong He, Shumin Jing, Y LuOct 28th, 2022
-
A. Corinne Huggins‐Manley, Brandon M. Bo...|Sep 28th, 2022|journalArticleA. Corinne Huggins‐Manley, Brandon M. Bo...Sep 28th, 2022
Abstract The field of educational measurement places validity and fairness as central concepts of assessment quality. Prior research has proposed embedding fairness arguments within argument‐based validity processes, particularly when fairness is conceived as comparability in assessment properties across groups. However, we argue that a more flexible approach to fairness arguments that occurs outside of and complementary to validity arguments is required to address many of the...
-
Matthew S. Johnson, Xiang Liu, Daniel F....|Sep 28th, 2022|journalArticleMatthew S. Johnson, Xiang Liu, Daniel F....Sep 28th, 2022
-
Susan Lottridge, Mackenzie Young|Oct 28th, 2022|conferencePaperSusan Lottridge, Mackenzie YoungOct 28th, 2022
The use of automated scoring (AS) of constructed responses has become increasingly common in k - 12 formative, interim, and summative assessment programs. AS has been shown to perform well in essay writing, reading comprehension, and mathematics. However, less is known about how automated scoring engines perform for key subgroups such as gender, race/ethnicity, English proficiency status, disability status, and economic status. Bias evaluations have focused primarily on mean score...
-
Christopher Ormerod|Oct 28th, 2022|journalArticleChristopher OrmerodOct 28th, 2022
We introduce a regression-based framework to explore the dependence that global features have on score predictions from pretrained transformer-based language models used for Automated Essay Scoring (AES). We demonstrate that neural networks use approximations of rubric-relevant global features to determine a score prediction. By considering linear models on the hidden states, we can approximate global features and measure their importance to score predictions. This study uses DeBERTa models...
-
Maria Mercedes Rodrigo, Noburu Matsuda, ...|Oct 28th, 2022|bookMaria Mercedes Rodrigo, Noburu Matsuda, ...Oct 28th, 2022
-
Shiki Sato, Yosuke Kishinami, Hiroaki Su...|Oct 28th, 2022|conferencePaperShiki Sato, Yosuke Kishinami, Hiroaki Su...Oct 28th, 2022
Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the...
-
Zachari Swiecki, Hassan Khosravi, Guanli...|Oct 28th, 2022|journalArticleZachari Swiecki, Hassan Khosravi, Guanli...Oct 28th, 2022
-
Shunya Takano, Osamu Ichikawa|Oct 28th, 2022|conferencePaperShunya Takano, Osamu IchikawaOct 28th, 2022
-
Ruben van Genugten, Daniel L Schacter|Oct 28th, 2022|journalArticleRuben van Genugten, Daniel L SchacterOct 28th, 2022
-
Kafeng Wang, Pengyang Wang, Chengzhong x...|Oct 28th, 2022|journalArticleKafeng Wang, Pengyang Wang, Chengzhong x...Oct 28th, 2022
Automated Feature Engineering (AFE) refers to automatically generate and select optimal feature sets for downstream tasks, which has achieved great success in real-world applications. Current AFE methods mainly focus on improving the effectiveness of the produced features, but ignoring the low-efficiency issue for large-scale deployment. Therefore, in this work, we propose a generic framework to improve the efficiency of AFE. Specifically, we construct the AFE pipeline based on reinforcement...
-
Zichao Wang, Jakob Valdez, Debshila Basu...|Oct 28th, 2022|bookSectionZichao Wang, Jakob Valdez, Debshila Basu...Oct 28th, 2022
-
Zichao Wang, Jakob Valdez, Debshila Basu...|Oct 28th, 2022|bookSectionZichao Wang, Jakob Valdez, Debshila Basu...Oct 28th, 2022
-
Zichao Wang, Jakob Valdez, Debshila Basu...|Oct 28th, 2022|bookSectionZichao Wang, Jakob Valdez, Debshila Basu...Oct 28th, 2022
-
Ying Xu, Dakuo Wang, Mo Yu|Oct 28th, 2022|journalArticleYing Xu, Dakuo Wang, Mo YuOct 28th, 2022
Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative...
-
Ming Zhong, Yang Liu, Da Yin|Oct 28th, 2022|preprintMing Zhong, Yang Liu, Da YinOct 28th, 2022
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...
-
Ming Zhong, Yang Liu, Da Yin|Oct 28th, 2022|preprintMing Zhong, Yang Liu, Da YinOct 28th, 2022
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...
-
Masaki Uto, Masashi Okano|Dec 1st, 2021|journalArticleMasaki Uto, Masashi OkanoDec 1st, 2021