Results – Evidence Library – Artificial Intelligence in Measurement and Education

Applying large language models for automated essay scoring for non-native Japanese

Wenchao Li, Haitao Liu

|

Jun 3rd, 2024

|

journalArticle

Wenchao Li, Haitao Liu

Jun 3rd, 2024

Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated listening tests, and automated oral proficiency assessments. The application of LLMs for AES in the context of non-native Japanese, however, remains limited. This study explores the potential of LLM-based AES by comparing the efficiency of different models, i.e. two conventional machine training...

Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification

Liu

|

Aug 22nd, 2024

|

conferencePaper

Liu

Aug 22nd, 2024

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Jan 22nd, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Jan 22nd, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

|

Jan 22nd, 2021

|

journalArticle

Weizhe Yuan, Graham Neubig, Pengfei Liu,...

Jan 22nd, 2021

A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...

An Evaluation of Automated Item Scoring in the Context of Inter-Rater Reliability

Kate Nolan, Youngsoon Kang, Ran Liu

|

Jan 22nd, 2025

|

presentation

Kate Nolan, Youngsoon Kang, Ran Liu

Jan 22nd, 2025

Leveraging LLM respondents for item evaluation: A psychometric analysis

Yunting Liu, Shreya Bhandari, Zachary A....

|

Feb 24th, 2025

|

journalArticle

Yunting Liu, Shreya Bhandari, Zachary A....

Feb 24th, 2025

Effective educational measurement relies heavily on the curation of well‐designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT‐3.5, GPT‐4, Llama 2, Llama 3, Gemini‐Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents....

Leveraging LLM respondents for item evaluation: A psychometric analysis

Yunting Liu, Shreya Bhandari, Zachary A....

|

Feb 24th, 2025

|

journalArticle

Yunting Liu, Shreya Bhandari, Zachary A....

Feb 24th, 2025

Effective educational measurement relies heavily on the curation of well‐designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT‐3.5, GPT‐4, Llama 2, Llama 3, Gemini‐Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents....

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Jan 22nd, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Jan 22nd, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Ming Zhong, Yang Liu, Da Yin

|

Jan 22nd, 2022

|

preprint

Ming Zhong, Yang Liu, Da Yin

Jan 22nd, 2022

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...

Psychometric Methods to Evaluate Measurement and Algorithmic Bias in Automated Scoring

Matthew S. Johnson, Xiang Liu, Daniel F....

|

Sep 22nd, 2022

|

journalArticle

Matthew S. Johnson, Xiang Liu, Daniel F....

Sep 22nd, 2022

CLASS: A Design Framework for Building Intelligent Tutoring Systems Based on Learning Science principles

Shashank Sonkar, Naiming Liu, Debshila M...

|

Jan 22nd, 2023

|

conferencePaper

Shashank Sonkar, Naiming Liu, Debshila M...

Jan 22nd, 2023

An Early Review of Generative Language Models in Automated Writing Evaluation: Advancements, Challenges, and Future Directions for Automated Essay Scoring and Feedback Generation

Yue Huang, Corey Palermo, Ruitao Liu

|

Aug 27th, 2025

|

journalArticle

Yue Huang, Corey Palermo, Ruitao Liu

Aug 27th, 2025

Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future

Zheng Chu, Jingchang Chen, Qianglong Che...

|

Jan 22nd, 2024

|

preprint

Zheng Chu, Jingchang Chen, Qianglong Che...

Jan 22nd, 2024

Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence. Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM's reasoning capabilities, which attracts widespread attention from both academics and industry. In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives....

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Jan 22nd, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Jan 22nd, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Jan 22nd, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Jan 22nd, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

Automated Scoring of Chinese Grades 7–9 Students’ Competence in Interpreting and Arguing from Evidence

Cong Wang, Xiufeng Liu, Lei Wang

|

Apr 22nd, 2021

|

journalArticle

Cong Wang, Xiufeng Liu, Lei Wang

Apr 22nd, 2021

Automated Scoring of Chinese Grades 7–9 Students’ Competence in Interpreting and Arguing from Evidence

Cong Wang, Xiufeng Liu, Lei Wang

|

Sep 9th, 2020

|

journalArticle

Cong Wang, Xiufeng Liu, Lei Wang

Sep 9th, 2020

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

|

Jan 22nd, 2023

|

journalArticle

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

Jan 22nd, 2023

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

Can Automated Feedback Improve Teachers’ Uptake of Student Ideas? Evidence From a Randomized Controlled Trial In a Large-Scale Online Course

Annenberg Institute at Brown...

|

Jun 3rd, 2023

|

report

Annenberg Institute at Brown...

Jun 3rd, 2023

Providing consistent, individualized feedback to teachers is essential for improving instruction but can be prohibitively resource-intensive in most educational contexts. We develop M-Powering Teachers, an automated tool based on natural language processing to give teachers feedback on their uptake of student contributions, a high-leverage dialogic teaching practice that makes students feel heard. We conduct a randomized controlled trial in an online computer science course (n=1,136...

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

|

Jun 22nd, 2024

|

preprint

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

Jun 22nd, 2024

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

Search

Publication year