Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

app/docs/METHODOLOGY.md at main · weval-org/app

weval-org

|

Oct 11th, 2025

|

webpage

weval-org

Oct 11th, 2025

Contribute to weval-org/app development by creating an account on GitHub.

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth...

|

Oct 5th, 2025

|

preprint

Tejal Patwardhan, Rachel Dias, Elizabeth...

Oct 5th, 2025

We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and...

What past education technology failures can teach us about the future of AI in schools

Justin Reich

|

Oct 3rd, 2025

|

webpage

Justin Reich

Oct 3rd, 2025

It can take years to collect evidence that shows effective uses of new technologies in schools. Unfortunately, early guesses sometimes go seriously wrong.

How does artificial intelligence compare to human feedback? A meta-analysis of performance, feedback perception, and learning dispositions

Rogers Kaliisa, Kamila Misiejuk, Sonsole...

|

Sep 28th, 2025

|

journalArticle

Rogers Kaliisa, Kamila Misiejuk, Sonsole...

Sep 28th, 2025

This exploratory meta-analysis synthesises current research on the effectiveness of Artificial Intelligence (AI)-generated feedback compared to traditional human-provided feedback. Drawing on 41 studies involving a total of 4813 students, the findings reveal no statistically significant differences in learning performance between students who received AI-generated feedback and those who received human-provided feedback. The pooled effect size was small and statistically insignificant...

Fluid Language Model Benchmarking

Valentin Hofmann, David Heineman, Ian Ma...

|

Sep 14th, 2025

|

preprint

Valentin Hofmann, David Heineman, Ian Ma...

Sep 14th, 2025

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...

Instructional Agents: LLM Agents on Automated Course Material Generation for Teaching Faculties

Huaiyuan Yao, Wanpeng Xu, Justin Turnau,...

|

Sep 1st, 2025

|

preprint

Huaiyuan Yao, Wanpeng Xu, Justin Turnau,...

Sep 1st, 2025

Preparing high-quality instructional materials remains a labor-intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi-agent large language model (LLM) framework designed to automate end-to-end course material generation, including syllabus creation, lecture scripts, LaTeX-based slides, and assessments. Unlike existing AI-assisted educational tools that focus...

An Early Review of Generative Language Models in Automated Writing Evaluation: Advancements, Challenges, and Future Directions for Automated Essay Scoring and Feedback Generation

Yue Huang, Corey Palermo, Ruitao Liu

|

Aug 27th, 2025

|

journalArticle

Yue Huang, Corey Palermo, Ruitao Liu

Aug 27th, 2025

Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study

Danielle R. Thomas, Conrad Borchers, Jio...

|

Jun 20th, 2025

|

preprint

Danielle R. Thomas, Conrad Borchers, Jio...

Jun 20th, 2025

Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using...

Delving Into the Psychology of Machines: Exploring the Structure of Self-Regulated Learning via LLM-Generated Survey Responses

Leonie V. D. E. Vogelsmeier, Eduardo Oli...

|

Jun 16th, 2025

|

preprint

Leonie V. D. E. Vogelsmeier, Eduardo Oli...

Jun 16th, 2025

Large language models (LLMs) offer the potential to simulate human-like responses and behaviors, creating new opportunities for psychological science. In the context of self-regulated learning (SRL), if LLMs can reliably simulate survey responses at scale and speed, they could be used to test intervention scenarios, refine theoretical models, augment sparse datasets, and represent hard-to-reach populations. However, the validity of LLM-generated survey responses remains uncertain, with...

Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects

Reva Schwartz, Rumman Chowdhury, Akash K...

|

May 24th, 2025

|

preprint

Reva Schwartz, Rumman Chowdhury, Akash K...

May 24th, 2025

Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accurate, or contain toxic, biased or stereotypical content, but AI's second-order effects, i.e. any...

From Chalkboards to Chatbots : Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria

Martin Elias De Simone,Federico Hernan T...

|

May 20th, 2025

|

webpage

Martin Elias De Simone,Federico Hernan T...

May 20th, 2025

From Chalkboards to Chatbots : Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria (English)

Extending Minds with Generative AI

Andy Clark

|

May 19th, 2025

|

journalArticle

Andy Clark

May 19th, 2025

As human-AI collaborations become the norm, we should remind ourselves that it is our basic nature to build hybrid thinking systems – ones that fluidly incorporate non-biological resources. Recognizing this invites us to change the way we think about both the threats and promises of the coming age.

Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

Jadon Geathers, Yann Hicke, Colleen Chan...

|

May 15th, 2025

|

preprint

Jadon Geathers, Yann Hicke, Colleen Chan...

May 15th, 2025

Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE...

The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis

Jin Wang, Wenxiang Fan

|

May 6th, 2025

|

journalArticle

Jin Wang, Wenxiang Fan

May 6th, 2025

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems

Sahar Yarmohammadtoosky, Yiyun Zhou, Vic...

|

Apr 30th, 2025

|

preprint

Sahar Yarmohammadtoosky, Yiyun Zhou, Vic...

Apr 30th, 2025

This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system's weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the systems' robustness. Our...

AI as Normal Technology

Apr 26th, 2025

|

webpage

Apr 26th, 2025

Anthropic Education Report: How University Students Use Claude

Apr 26th, 2025

|

webpage

Apr 26th, 2025

AI systems are no longer just specialized research tools: they’re everyday academic companions. As AIs integrate more deeply into educational environments, we need to consider important questions about learning, assessment, and skill development. Until now, most discussions have relied on surveys and controlled experiments rather than direct evidence of how students naturally integrate AI into their academic work in real settings.

Using Rasch Modeling to Validate AI Scoring Models

The Learning Partnership

|

Apr 26th, 2025

|

report

The Learning Partnership

Apr 26th, 2025

Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms

Pooya Razavi, Sonya J. Powers

|

Apr 9th, 2025

|

preprint

Pooya Razavi, Sonya J. Powers

Apr 9th, 2025

Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a...

Fine-Tuning GPT-3.5-Turbo for Automatic Feedback Generation

Elisabetta Mazzullo, Okan Bulut, Cole Wa...

|

Mar 31st, 2025

|

conferencePaper

Elisabetta Mazzullo, Okan Bulut, Cole Wa...

Mar 31st, 2025

Search

Publication year