Results – Evidence Library – Artificial Intelligence in Measurement and Education

CLASS: A Design Framework for Building Intelligent Tutoring Systems Based on Learning Science principles

Shashank Sonkar, Naiming Liu, Debshila M...

|

Mar 10th, 2023

|

conferencePaper

Shashank Sonkar, Naiming Liu, Debshila M...

Mar 10th, 2023

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Mar 10th, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Mar 10th, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Mar 10th, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Mar 10th, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

Can Automated Feedback Improve Teachers’ Uptake of Student Ideas? Evidence From a Randomized Controlled Trial In a Large-Scale Online Course

Annenberg Institute at Brown...

|

Jun 3rd, 2023

|

report

Annenberg Institute at Brown...

Jun 3rd, 2023

Providing consistent, individualized feedback to teachers is essential for improving instruction but can be prohibitively resource-intensive in most educational contexts. We develop M-Powering Teachers, an automated tool based on natural language processing to give teachers feedback on their uptake of student contributions, a high-leverage dialogic teaching practice that makes students feel heard. We conduct a randomized controlled trial in an online computer science course (n=1,136...

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

|

Mar 10th, 2023

|

journalArticle

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

Mar 10th, 2023

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

|

Mar 10th, 2023

|

journalArticle

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

Mar 10th, 2023

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu

|

Mar 10th, 2023

|

preprint

Yang Liu, Dan Iter, Yichong Xu

Mar 10th, 2023

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references....

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su

|

Aug 14th, 2023

|

preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su

|

Aug 14th, 2023

|

preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

Preface Summary: There were 38 papers submitted for peer-review to this workshop. Out of these, 17 papers were accepted for this volume.

Steven Moore, John Stamper, Richard Tong...

|

Jul 7th, 2023

|

conferencePaper

Steven Moore, John Stamper, Richard Tong...

Jul 7th, 2023

Generating Multiple Choice Questions from a Textbook: LLMs Match Human Performance on Most Metrics

Andrew M. Olney, Steven Moore, John Stam...

|

Jul 7th, 2023

|

conferencePaper

Andrew M. Olney, Steven Moore, John Stam...

Jul 7th, 2023

The Unseen A+ Student: Evaluating the Performance and Detectability of Large Language Models in High School Assignments

Matyáš Boháček, Steven Moore, John Stamp...

|

Jul 7th, 2023

|

conferencePaper

Matyáš Boháček, Steven Moore, John Stamp...

Jul 7th, 2023

Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models

Shashank Sonkar, Richard G. Baraniuk, St...

|

Jul 7th, 2023

|

conferencePaper

Shashank Sonkar, Richard G. Baraniuk, St...

Jul 7th, 2023

An LLM-Powered Adaptive Practicing System

Md Rayhan Kabir, Fuhua Lin, Steven Moore...

|

Jul 7th, 2023

|

conferencePaper

Md Rayhan Kabir, Fuhua Lin, Steven Moore...

Jul 7th, 2023

Contextualizing Problems to Student Interests at Scale in Intelligent Tutoring System Using Large Language Models

Gautam Yadav, Ying-Jui Tseng, Xiaolin Ni...

|

Jul 7th, 2023

|

conferencePaper

Gautam Yadav, Ying-Jui Tseng, Xiaolin Ni...

Jul 7th, 2023

Generative Large Language Models for Dialog-Based Tutoring: An Early Consideration of Opportunities and Concerns

Benjamin D. Nye, Dillon Mee, Mark G. Cor...

|

Jul 7th, 2023

|

conferencePaper

Benjamin D. Nye, Dillon Mee, Mark G. Cor...

Jul 7th, 2023

Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming

Qianou Christina Ma, Sherry Tongshuang W...

|

Jul 7th, 2023

|

conferencePaper

Qianou Christina Ma, Sherry Tongshuang W...

Jul 7th, 2023

Using LLM (Large Language Model) to Improve Efficiency in Literature Review for Undergraduate Research

Shouvik Ahmed Antu, Haiyan Chen, Cindy K...

|

Jul 7th, 2023

|

conferencePaper

Shouvik Ahmed Antu, Haiyan Chen, Cindy K...

Jul 7th, 2023

Language Modeling for Plan Generation in Game-Base Learning Environments

Alex Goslen, Yeo Jin Kim, Jonathan Rowe,...

|

Jul 7th, 2023

|

conferencePaper

Alex Goslen, Yeo Jin Kim, Jonathan Rowe,...

Jul 7th, 2023

Prototyping the use of Large Language Models (LLMs) for Adult Learning Content Creation at Scale

Daniel Leiker, Sara Finnigan, Ashley Ric...

|

Jul 7th, 2023

|

conferencePaper

Daniel Leiker, Sara Finnigan, Ashley Ric...

Jul 7th, 2023

Search

Publication year