Results – Evidence Library – Artificial Intelligence in Measurement and Education

Interaction Challenges in AI Equipped Environments Built to Teach Foreign Languages Through Dialogue and Task-Completion

Rahul R. Divekar, Jaimie Drozdal, Yalun ...

|

Jun 8th, 2018

|

conferencePaper

Rahul R. Divekar, Jaimie Drozdal, Yalun ...

Jun 8th, 2018

A comparison of grammatical proficiency measures in the automated assessment of spontaneous speech

Su-Youn Yoon, Suma Bhat

|

May 24th, 2018

|

journalArticle

Su-Youn Yoon, Suma Bhat

May 24th, 2018

Understanding Mean Score Differences Between the e‐rater ® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE ® General Test

Chaitanya Ramineni, David Williamson

|

Apr 27th, 2018

|

journalArticle

Chaitanya Ramineni, David Williamson

Apr 27th, 2018

Notable mean score differences for the e‐rater® automated scoring engine and for humans for essays from certain demographic groups were observed for the GRE® General Test in use before the major revision of 2012, called rGRE. The use of e‐rater as a check‐score model with discrepancy thresholds prevented an adverse impact on the examinee score at the item or test level. Despite this control, there remains a need to understand the root causes of these demographically based score differences...

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

Kavita Ganesan

|

Mar 5th, 2018

|

preprint

Kavita Ganesan

Mar 5th, 2018

Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing ROUGE measures in terms of capturing synonymous concepts and coverage of topics. Thus, often times...

The Influence of Rater Effects in Training Sets on the Psychometric Quality of Automated Scoring for Writing Assessments

Stefanie A. Wind, Edward W. Wolfe, Georg...

|

Jan 2nd, 2018

|

journalArticle

Stefanie A. Wind, Edward W. Wolfe, Georg...

Jan 2nd, 2018

Automated Essay Scoring in the Presence of Biased Ratings

Evelin Amorim, Marcia Cançado, Adriano V...

|

Apr 24th, 2018

|

conferencePaper

Evelin Amorim, Marcia Cançado, Adriano V...

Apr 24th, 2018

CNN for Text-Based Multiple Choice Question Answering

Akshay Chaturvedi, Onkar Pandit, Utpal G...

|

Apr 24th, 2018

|

conferencePaper

Akshay Chaturvedi, Onkar Pandit, Utpal G...

Apr 24th, 2018

Yaliang Li, Liuyi Yao, Nan Du

|

Apr 24th, 2018

|

journalArticle

Yaliang Li, Liuyi Yao, Nan Du

Apr 24th, 2018

The past few years have witnessed the flourishing of crowdsourced medical question answering (Q&A) websites. Patients who have medical information demands tend to post questions about their health conditions on these crowdsourced Q&A websites and get answers from other users. However, we observe that a large portion of new medical questions cannot be answered in time or receive only few answers from these websites. On the other hand, we notice that solved questions have great...

Towards Controllable Story Generation

Nanyun Peng, Marjan Ghazvininejad, Jonat...

|

Apr 24th, 2018

|

conferencePaper

Nanyun Peng, Marjan Ghazvininejad, Jonat...

Apr 24th, 2018

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

Cynthia Rudin

|

Apr 24th, 2018

|

journalArticle

Cynthia Rudin

Apr 24th, 2018

Black box machine learning models are currently being used for high stakes decision-making throughout society, causing problems throughout healthcare, criminal justice, and in other domains. People have hoped that creating methods for explaining these black box models will alleviate some of these problems, but trying to \textit{explain} black box models, rather than creating models that are \textit{interpretable} in the first place, is likely to perpetuate bad practices and can potentially...

Use of Automated Scoring Features to Generate Hypotheses Regarding Language-Based DIF

Mark D. Shermis, Liyang Mao, Matthew Mul...

|

Oct 2nd, 2017

|

journalArticle

Mark D. Shermis, Liyang Mao, Matthew Mul...

Oct 2nd, 2017

Approaches to Automated Scoring of Speaking for K–12 English Language Proficiency Assessments

Keelan Evanini, Maurice Cogan Hauck, Ken...

|

May 2nd, 2017

|

journalArticle

Keelan Evanini, Maurice Cogan Hauck, Ken...

May 2nd, 2017

This report is the fifth in a series concerning English language proficiency (ELP) assessments for English learners (ELs) in kindergarten through 12th grade in the United States. The series, produced by Educational Testing Service (ETS), is intended to provide theory‐ and evidence‐based principles and recommendations for improving next‐generationELPassessment systems, policies, and practices and to stimulate discussion on better serving K–12ELstudents. The first report articulated a...

Use of Automated Scoring Features to Generate Hypotheses Regarding Language-Based DIF

Mark D. Shermis, Liyang Mao, Matthew Mul...

|

May 24th, 2017

|

journalArticle

Mark D. Shermis, Liyang Mao, Matthew Mul...

May 24th, 2017

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Sanjeev Arora, Yingyu Liang, Tengyu Ma

|

Apr 24th, 2017

|

conferencePaper

Sanjeev Arora, Yingyu Liang, Tengyu Ma

Apr 24th, 2017

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Ryan Lowe, Michael Noseworthy, Iulian Vl...

|

Apr 24th, 2017

|

conferencePaper

Ryan Lowe, Michael Noseworthy, Iulian Vl...

Apr 24th, 2017

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Nitin Madnani, Anastassia Loukina, Alina...

|

Apr 24th, 2017

|

conferencePaper

Nitin Madnani, Anastassia Loukina, Alina...

Apr 24th, 2017

Exploring Automated Essay Scoring for Nonnative English Speakers

Amber Nigam

|

Apr 24th, 2017

|

conferencePaper

Amber Nigam

Apr 24th, 2017

Automated Essay Scoring (AES) has been quite popular and is being widely used. However, lack of appropriate methodology for rating nonnative English speakers' essays has meant a lopsided advancement in this field. In this paper, we report initial results of our experiments with nonnative AES that learns from manual evaluation of nonnative essays. For this purpose, we conducted an exercise in which essays written by nonnative English speakers in test environment were rated both manually and...

Investigating neural architectures for short answer scoring

Brian Riordan, Andrea Horbach, Aoife Cah...

|

Apr 24th, 2017

|

conferencePaper

Brian Riordan, Andrea Horbach, Aoife Cah...

Apr 24th, 2017

How Does Predicate Invention Affect Human Comprehensibility?

Ute Schmid, Christina Zeller, Tarek Beso...

|

Apr 24th, 2017

|

bookSection

Ute Schmid, Christina Zeller, Tarek Beso...

Apr 24th, 2017

Associated effects of automated essay evaluation software on growth in writing quality for students with and without disabilities

Joshua Wilson

|

Apr 24th, 2017

|

journalArticle

Joshua Wilson

Apr 24th, 2017

Search

Publication year