Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Nitin Madnani, Anastassia Loukina, Alina...

|

Mar 17th, 2017

|

conferencePaper

Nitin Madnani, Anastassia Loukina, Alina...

Mar 17th, 2017

Exploring Automated Essay Scoring for Nonnative English Speakers

Amber Nigam

|

Mar 17th, 2017

|

conferencePaper

Amber Nigam

Mar 17th, 2017

Automated Essay Scoring (AES) has been quite popular and is being widely used. However, lack of appropriate methodology for rating nonnative English speakers' essays has meant a lopsided advancement in this field. In this paper, we report initial results of our experiments with nonnative AES that learns from manual evaluation of nonnative essays. For this purpose, we conducted an exercise in which essays written by nonnative English speakers in test environment were rated both manually and...

Investigating neural architectures for short answer scoring

Brian Riordan, Andrea Horbach, Aoife Cah...

|

Mar 17th, 2017

|

conferencePaper

Brian Riordan, Andrea Horbach, Aoife Cah...

Mar 17th, 2017

How Does Predicate Invention Affect Human Comprehensibility?

Ute Schmid, Christina Zeller, Tarek Beso...

|

Mar 17th, 2017

|

bookSection

Ute Schmid, Christina Zeller, Tarek Beso...

Mar 17th, 2017

Associated effects of automated essay evaluation software on growth in writing quality for students with and without disabilities

Joshua Wilson

|

Apr 17th, 2017

|

journalArticle

Joshua Wilson

Apr 17th, 2017

Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities

Victoria Yaneva, Constantin Orasan, Rich...

|

Mar 17th, 2017

|

conferencePaper

Victoria Yaneva, Constantin Orasan, Rich...

Mar 17th, 2017

Monitoring the performance of human and automated scores for spoken responses

Zhen Wang, Klaus Zechner, Yu Sun

|

Dec 19th, 2016

|

journalArticle

Zhen Wang, Klaus Zechner, Yu Sun

Dec 19th, 2016

As automated scoring systems for spoken responses are increasingly used in language assessments, testing organizations need to analyze their performance, as compared to human raters, across several dimensions, for example, on individual items or based on subgroups of test takers. In addition, there is a need in testing organizations to establish rigorous procedures for monitoring the performance of both human and automated scoring processes during operational administrations. This paper...

Comparing Human and Automated Essay Scoring for Prospective Graduate Students With Learning Disabilities and/or ADHD

Heather Buzick, Maria Elena Oliveri, Yig...

|

Jul 2nd, 2016

|

journalArticle

Heather Buzick, Maria Elena Oliveri, Yig...

Jul 2nd, 2016

Creating a Next‐Generation System of K–12 English Learner Language Proficiency Assessments

Maurice Cogan Hauck, Mikyung Kim Wolf, R...

|

Apr 4th, 2016

|

journalArticle

Maurice Cogan Hauck, Mikyung Kim Wolf, R...

Apr 4th, 2016

This paper is the first in a series from Educational Testing Service (ETS) concerning English language proficiency (ELP) assessments for K–12 English learners (ELs). The goal of this paper, and the series, is to present research‐based ideas, principles, and recommendations for consideration by those who are conceptualizing, developing, and implementing ELP assessments for K–12 ELs and by all stakeholders in their education and assessment. We also hope to contribute to the active current...

Analyzing the Behavior of Visual Question Answering Models

Aishwarya Agrawal, Dhruv Batra, Devi Par...

|

Mar 17th, 2016

|

journalArticle

Aishwarya Agrawal, Dhruv Batra, Devi Par...

Mar 17th, 2016

Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze two models, one each from two major classes of VQA models -- with-attention and without-attention...

Automatic Generation of Context-Based Fill-in-the-Blank Exercises Using Co-occurrence Likelihoods and Google n-grams

Jennifer Hill, Rahul Simha

|

Mar 17th, 2016

|

conferencePaper

Jennifer Hill, Rahul Simha

Mar 17th, 2016

Automated Essay Evaluation for English Language Learners:A Case Study of MY Access

Giang Thi Linh Hoang, Antony John Kunnan...

|

Oct 17th, 2016

|

journalArticle

Giang Thi Linh Hoang, Antony John Kunnan...

Oct 17th, 2016

Human-level concept learning through probabilistic program induction

Brenden M. Lake, Ruslan Salakhutdinov, J...

|

Dec 11th, 2015

|

journalArticle

Brenden M. Lake, Ruslan Salakhutdinov, J...

Dec 11th, 2015

Handwritten characters drawn by a model Not only do children learn effortlessly, they do so quickly and with a remarkable ability to use what they have learned as the raw material for creating new stuff. Lake et al. describe a computational model that learns in a similar fashion and does so better than current deep learning algorithms. The model classifies, parses, and recreates handwritten characters, and can generate new letters of the...

Comparing the Effectiveness of Self‐Paced and Collaborative Frame‐of‐Reference Training on Rater Accuracy in a Large‐Scale Writing Assessment

Kevin R. Raczynski, Allan S. Cohen, Geor...

|

Sep 17th, 2015

|

journalArticle

Kevin R. Raczynski, Allan S. Cohen, Geor...

Sep 17th, 2015

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After...

Using Technology-Enhanced Processes to Generate Test Items in Multiple Languages

Mark J. Gierl, Hollis Lai, Karen Fung

|

Aug 20th, 2015

|

bookSection

Mark J. Gierl, Hollis Lai, Karen Fung

Aug 20th, 2015

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnic...

|

Jun 17th, 2015

|

preprint

Ramakrishna Vedantam, C. Lawrence Zitnic...

Jun 17th, 2015

Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new...

The Impact of Training Data on Automated Short Answer Scoring Performance

Michael Heilman, Nitin Madnani

|

Mar 17th, 2015

|

conferencePaper

Michael Heilman, Nitin Madnani

Mar 17th, 2015

LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages

Sami Virpioja, Stig-Arne Grönroos

|

Mar 17th, 2015

|

conferencePaper

Sami Virpioja, Stig-Arne Grönroos

Mar 17th, 2015

The Eras and Trends of Automatic Short Answer Grading

Steven Burrows, Iryna Gurevych, Benno St...

|

Oct 23rd, 2014

|

journalArticle

Steven Burrows, Iryna Gurevych, Benno St...

Oct 23rd, 2014

Automated versus human scoring: A case study in an EFL context

S.-J Huang

|

Jul 17th, 2014

|

journalArticle

S.-J Huang

Jul 17th, 2014

Search

Publication year