Search
702 resources
-
weval-org|Oct 11th, 2025|webpageweval-orgOct 11th, 2025
Contribute to weval-org/app development by creating an account on GitHub.
-
Tejal Patwardhan, Rachel Dias, Elizabeth...|Oct 5th, 2025|preprintTejal Patwardhan, Rachel Dias, Elizabeth...Oct 5th, 2025
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and...
-
Justin Reich|Oct 3rd, 2025|webpageJustin ReichOct 3rd, 2025
It can take years to collect evidence that shows effective uses of new technologies in schools. Unfortunately, early guesses sometimes go seriously wrong.
-
Rogers Kaliisa, Kamila Misiejuk, Sonsole...|Sep 28th, 2025|journalArticleRogers Kaliisa, Kamila Misiejuk, Sonsole...Sep 28th, 2025
This exploratory meta-analysis synthesises current research on the effectiveness of Artificial Intelligence (AI)-generated feedback compared to traditional human-provided feedback. Drawing on 41 studies involving a total of 4813 students, the findings reveal no statistically significant differences in learning performance between students who received AI-generated feedback and those who received human-provided feedback. The pooled effect size was small and statistically insignificant...
-
Valentin Hofmann, David Heineman, Ian Ma...|Sep 14th, 2025|preprintValentin Hofmann, David Heineman, Ian Ma...Sep 14th, 2025
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...
-
Huaiyuan Yao, Wanpeng Xu, Justin Turnau,...|Sep 1st, 2025|preprintHuaiyuan Yao, Wanpeng Xu, Justin Turnau,...Sep 1st, 2025
Preparing high-quality instructional materials remains a labor-intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi-agent large language model (LLM) framework designed to automate end-to-end course material generation, including syllabus creation, lecture scripts, LaTeX-based slides, and assessments. Unlike existing AI-assisted educational tools that focus...
-
Yue Huang, Corey Palermo, Ruitao Liu|Aug 27th, 2025|journalArticleYue Huang, Corey Palermo, Ruitao LiuAug 27th, 2025
-
Danielle R. Thomas, Conrad Borchers, Jio...|Jun 20th, 2025|preprintDanielle R. Thomas, Conrad Borchers, Jio...Jun 20th, 2025
Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using...
-
Leonie V. D. E. Vogelsmeier, Eduardo Oli...|Jun 16th, 2025|preprintLeonie V. D. E. Vogelsmeier, Eduardo Oli...Jun 16th, 2025
Large language models (LLMs) offer the potential to simulate human-like responses and behaviors, creating new opportunities for psychological science. In the context of self-regulated learning (SRL), if LLMs can reliably simulate survey responses at scale and speed, they could be used to test intervention scenarios, refine theoretical models, augment sparse datasets, and represent hard-to-reach populations. However, the validity of LLM-generated survey responses remains uncertain, with...
-
Reva Schwartz, Rumman Chowdhury, Akash K...|May 24th, 2025|preprintReva Schwartz, Rumman Chowdhury, Akash K...May 24th, 2025
Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accurate, or contain toxic, biased or stereotypical content, but AI's second-order effects, i.e. any...
-
Martin Elias De Simone,Federico Hernan T...|May 20th, 2025|webpageMartin Elias De Simone,Federico Hernan T...May 20th, 2025
From Chalkboards to Chatbots : Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria (English)
-
Andy Clark|May 19th, 2025|journalArticleAndy ClarkMay 19th, 2025
As human-AI collaborations become the norm, we should remind ourselves that it is our basic nature to build hybrid thinking systems – ones that fluidly incorporate non-biological resources. Recognizing this invites us to change the way we think about both the threats and promises of the coming age.
-
Jadon Geathers, Yann Hicke, Colleen Chan...|May 15th, 2025|preprintJadon Geathers, Yann Hicke, Colleen Chan...May 15th, 2025
Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE...
-
Jin Wang, Wenxiang Fan|May 6th, 2025|journalArticleJin Wang, Wenxiang FanMay 6th, 2025
-
Sahar Yarmohammadtoosky, Yiyun Zhou, Vic...|Apr 30th, 2025|preprintSahar Yarmohammadtoosky, Yiyun Zhou, Vic...Apr 30th, 2025
This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system's weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the systems' robustness. Our...
-
Apr 26th, 2025|webpageApr 26th, 2025
-
Apr 26th, 2025|webpageApr 26th, 2025
AI systems are no longer just specialized research tools: they’re everyday academic companions. As AIs integrate more deeply into educational environments, we need to consider important questions about learning, assessment, and skill development. Until now, most discussions have relied on surveys and controlled experiments rather than direct evidence of how students naturally integrate AI into their academic work in real settings.
-
The Learning Partnership|Apr 26th, 2025|reportThe Learning PartnershipApr 26th, 2025
-
Pooya Razavi, Sonya J. Powers|Apr 9th, 2025|preprintPooya Razavi, Sonya J. PowersApr 9th, 2025
Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a...
-
Elisabetta Mazzullo, Okan Bulut, Cole Wa...|Mar 31st, 2025|conferencePaperElisabetta Mazzullo, Okan Bulut, Cole Wa...Mar 31st, 2025