Results – Evidence Library – Artificial Intelligence in Measurement and Education

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Apr 24th, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Apr 24th, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Apr 24th, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Apr 24th, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Psychometric Methods to Evaluate Measurement and Algorithmic Bias in Automated Scoring

Matthew S. Johnson, Xiang Liu, Daniel F....

|

Sep 24th, 2022

|

journalArticle

Matthew S. Johnson, Xiang Liu, Daniel F....

Sep 24th, 2022

Fully Adaptive Framework: Neural Computerized Adaptive Testing for Online Education

Yan Zhuang, Qi Liu, Zhenya Huang

|

Jun 28th, 2022

|

journalArticle

Yan Zhuang, Qi Liu, Zhenya Huang

Jun 28th, 2022

Computerized Adaptive Testing (CAT) refers to an efficient and personalized test mode in online education, aiming to accurately measure student proficiency level on the required subject/domain. The key component of CAT is the "adaptive" question selection algorithm, which automatically selects the best suited question for student based on his/her current estimated proficiency, reducing test length. Existing algorithms rely on some manually designed and pre-fixed informativeness/uncertainty...

Automated Scoring for Reading Comprehension via In-context BERT Tuning

Nigel Fernandez, Aritra Ghosh, Naiming L...

|

Apr 24th, 2022

|

preprint

Nigel Fernandez, Aritra Ghosh, Naiming L...

Apr 24th, 2022

Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two...

A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level

Iddo Drori, Sarah Zhang, Reece Shuttlewo...

|

Aug 2nd, 2022

|

journalArticle

Iddo Drori, Sarah Zhang, Reece Shuttlewo...

Aug 2nd, 2022

We demonstrate that a neural network pretrained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI’s Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a dataset of questions from Massachusetts Institute of Technology (MIT)’s largest mathematics courses (Single Variable and Multivariable Calculus,...

Search

Publication year