Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression

Leonora Kaldaras, Nicholas R. Yoshida, K...

|

Nov 25th, 2022

|

journalArticle

Leonora Kaldaras, Nicholas R. Yoshida, K...

Nov 25th, 2022

The Framework for K-12 Science Education (the Framework) and the Next- Generation Science Standards (NGSS) define three dimensions of science: disciplinary core ideas, scientific and engineering practices, and crosscutting concepts and emphasize the integration of the three dimensions (3D) to reflect deep science understanding. The Framework also emphasizes the importance of using learning progressions (LPs) as roadmaps to guide assessment development. These assessments capable of measuring...

A scoping review on the use of speech-to-text technology for adolescents with learning difficulties in secondary education

Marianne Engen Matre, David Lansing Came...

|

Nov 25th, 2022

|

journalArticle

Marianne Engen Matre, David Lansing Came...

Nov 25th, 2022

To identify and describe the aims, methodological approaches, and major findings of studies on the use of STT among secondary pupils (age 12–18) with learning difficulties published from January 2000 to April 2022. This scoping review includes empirical studies published in peer-reviewed journals and grey literature between January 2000 and April 2022. Searches were conducted in April 2022 in three databases: ERIC, PsycINFO and Scopus. In addition, related reviews were manually screened for...

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Alexandra Sasha Luccioni, Sylvain Viguie...

|

Nov 3rd, 2022

|

preprint

Alexandra Sasha Luccioni, Sylvain Viguie...

Nov 3rd, 2022

Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires significant computational resources, energy and materials. In the present article, we aim to quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle. We estimate that BLOOM's final training emitted approximately 24.7 tonnes of~\carboneq~if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes...

Validity of Chatbot Use for Mental Health Assessment: Experimental Study

Anita Schick, Jasper Feine, Stefan Moran...

|

Oct 31st, 2022

|

journalArticle

Anita Schick, Jasper Feine, Stefan Moran...

Oct 31st, 2022

Mental disorders in adolescence and young adulthood are major public health concerns. Digital tools such as text-based conversational agents (ie, chatbots) are a promising technology for facilitating mental health assessment. However, the human-like interaction style of chatbots may induce potential biases, such as socially desirable responding (SDR), and may require further effort to complete assessments.

Best Practices for Constructed‐Response Scoring

Daniel F. McCaffrey, Jodi M. Casabianca,...

|

Oct 22nd, 2022

|

journalArticle

Daniel F. McCaffrey, Jodi M. Casabianca,...

Oct 22nd, 2022

This document describes a set of best practices for developing, implementing, and maintaining the critical process of scoring constructed‐response tasks. These practices address both the use of human raters and automated scoring systems as part of the scoring process and cover the scoring of written, spoken, performance, or multimodal responses. Best Practices for Constructed‐Response Scoring is designed not to act as an independent guide, but rather to be used in conjunction with other ETS...

An Evaluation of Automatic Item Generation: A Case Study of Weak Theory Approach

Yanyan Fu, Edison M. Choe, Hwanggyu Lim,...

|

Oct 6th, 2022

|

journalArticle

Yanyan Fu, Edison M. Choe, Hwanggyu Lim,...

Oct 6th, 2022

This case study applied the weak theory of Automatic Item Generation (AIG) to generate isomorphic item instances (i.e., unique but psychometrically equivalent items) for a large‐scale assessment. Three representative instances were selected from each item template (i.e., model) and pilot‐tested. In addition, a new analytical framework, differential child item functioning (DCIF) analysis, based on the existing differential item functioning statistics, was applied to evaluate the psychometric...

The Intertwined Histories of Artificial Intelligence and Education

Shayan Doroudi

|

Oct 4th, 2022

|

journalArticle

Shayan Doroudi

Oct 4th, 2022

In this paper, I argue that the fields of artificial intelligence (AI) and education have been deeply intertwined since the early days of AI. Specifically, I show that many of the early pioneers of AI were cognitive scientists who also made pioneering and impactful contributions to the field of education. These researchers saw AI as a tool for thinking about human learning and used their understanding of how people learn to further AI. Furthermore, I trace two distinct approaches to thinking...

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Cyril Chhun, Pierre Colombo, Chloé Clave...

|

Sep 15th, 2022

|

preprint

Cyril Chhun, Pierre Colombo, Chloé Clave...

Sep 15th, 2022

Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10...

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Cyril Chhun, Pierre Colombo, Chloé Clave...

|

Sep 15th, 2022

|

preprint

Cyril Chhun, Pierre Colombo, Chloé Clave...

Sep 15th, 2022

Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10...

Semi‐automatic coding of open‐ended text responses in large‐scale assessments

Nico Andersen, Fabian Zehner, Frank Gold...

|

Sep 11th, 2022

|

journalArticle

Nico Andersen, Fabian Zehner, Frank Gold...

Sep 11th, 2022

In the context of large‐scale educational assessments, the effort required to code open‐ended text responses is considerably more expensive and time‐consuming than the evaluation of multiple‐choice responses because it requires trained personnel and long manual coding sessions.AimOur semi‐supervised coding method eco (exploring coding assistant) dynamically supports human raters by automatically coding a subset of the responses.MethodWe map normalized response texts into a semantic space and...

Comparing Measures From Computer-Administered and Examiner-Administered Narrative Retells in Spanish: A Pilot Study

John Heilmann, Denise Finneran, Maura Mo...

|

Sep 7th, 2022

|

journalArticle

John Heilmann, Denise Finneran, Maura Mo...

Sep 7th, 2022

Narrative language sample analysis (LSA) is a recommended best practice for the assessment of monolingual and bilingual children. With business-as-usual narrative LSA, examiners are actively involved in all aspects of the elicitation. Software advancements have shown multiple benefits of computer-administered language assessments, some of which may be beneficial for narrative assessments, particularly for bilingual children. The goal of this pilot study was to test the feasibility of...

Using Chatbots as AI Conversational Partners in Language Learning

Jose Belda-Medina, José Ramón Calvo-Ferr...

|

Aug 24th, 2022

|

journalArticle

Jose Belda-Medina, José Ramón Calvo-Ferr...

Aug 24th, 2022

Recent advances in Artificial Intelligence (AI) and machine learning have paved the way for the increasing adoption of chatbots in language learning. Research published to date has mostly focused on chatbot accuracy and chatbot–human communication from students’ or in-service teachers’ perspectives. This study aims to examine the knowledge, level of satisfaction and perceptions concerning the integration of conversational AI in language learning among future educators. In this mixed method...

Using HeuristicsMiner to Analyze Problem-Solving Processes: Exemplary Use Case of a Productive-Failure Study

Christian Hartmann, Nikol Rummel, Maria ...

|

Aug 9th, 2022

|

journalArticle

Christian Hartmann, Nikol Rummel, Maria ...

Aug 9th, 2022

This paper presents a fine-grained process analysis of 22 students in a classroom-based learning setting. The students engaged (and failed) in problem-solving attempts prior to instruction (i.e., the Productive-Failure approach). We used the HeuristicsMiner algorithm to analyze the data of a quasi-experimental study. The applied algorithm allowed us to investigate temporally structured think-aloud data, to outline productive and unproductive problem-solving strategies. Our analyses and...

The Machines Take Over: A Comparison of Various Supervised Learning Approaches for Automated Scoring of Divergent Thinking Tasks

Philip Buczak, He Huang, Boris Forthmann...

|

Aug 8th, 2022

|

journalArticle

Philip Buczak, He Huang, Boris Forthmann...

Aug 8th, 2022

Traditionally, researchers employ human raters for scoring responses to creative thinking tasks. Apart from the associated costs this approach entails two potential risks. First, human raters can be subjective in their scoring behavior (inter‐rater‐variance). Second, individual raters are prone to inconsistent scoring patterns (intra‐rater‐variance). In light of these issues, we present an approach for automated scoring of Divergent Thinking (DT) Tasks. We implemented a pipeline aiming to...

Automatic grading for Arabic short answer questions using optimized deep learning model

Mustafa Abdul Salam, Mohamed Abd El-Fata...

|

Aug 2nd, 2022

|

journalArticle

Mustafa Abdul Salam, Mohamed Abd El-Fata...

Aug 2nd, 2022

Auto-grading of short answer questions is considered a challenging problem in the processing of natural language. It requires a system to comprehend the free text answers to automatically assign a grade for a student answer compared to one or more model answers. This paper suggests an optimized deep learning model for grading short-answer questions automatically by using various sizes of datasets collected in the Science subject for students in seventh grade in Egypt. The proposed system is...

A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level

Iddo Drori, Sarah Zhang, Reece Shuttlewo...

|

Aug 2nd, 2022

|

journalArticle

Iddo Drori, Sarah Zhang, Reece Shuttlewo...

Aug 2nd, 2022

We demonstrate that a neural network pretrained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI’s Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a dataset of questions from Massachusetts Institute of Technology (MIT)’s largest mathematics courses (Single Variable and Multivariable Calculus,...

The interactive reading task: Transformer-based automatic item generation

Yigal Attali, Andrew Runge, Geoffrey T. ...

|

Jul 22nd, 2022

|

journalArticle

Yigal Attali, Andrew Runge, Geoffrey T. ...

Jul 22nd, 2022

Automatic item generation (AIG) has the potential to greatly expand the number of items for educational assessments, while simultaneously allowing for a more construct-driven approach to item development. However, the traditional item modeling approach in AIG is limited in scope to content areas that are relatively easy to model (such as math problems), and depends on highly skilled content experts to create each model. In this paper we describe the interactive reading task, a...

The interactive reading task: Transformer-based automatic item generation

Yigal Attali, Andrew Runge, Geoffrey T. ...

|

Jul 22nd, 2022

|

journalArticle

Yigal Attali, Andrew Runge, Geoffrey T. ...

Jul 22nd, 2022

Automatic item generation (AIG) has the potential to greatly expand the number of items for educational assessments, while simultaneously allowing for a more construct-driven approach to item development. However, the traditional item modeling approach in AIG is limited in scope to content areas that are relatively easy to model (such as math problems), and depends on highly skilled content experts to create each model. In this paper we describe the interactive reading task, a...

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan A...

|

Jul 12th, 2022

|

preprint

Rishi Bommasani, Drew A. Hudson, Ehsan A...

Jul 12th, 2022

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...

InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation

Pierre Jean A. Colombo, Chloé Clavel, Pa...

|

Jun 28th, 2022

|

journalArticle

Pierre Jean A. Colombo, Chloé Clavel, Pa...

Jun 28th, 2022

Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce...

Search

Publication year