Results – Evidence Library – Artificial Intelligence in Measurement and Education

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao

|

Feb 19th, 2024

|

preprint

Peiyi Wang, Lei Li, Zhihong Shao

Feb 19th, 2024

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for...

Comparing the Quality of Human and ChatGPT Feedback on Students’ Writing

Jacob Steiss, Tamara Tate, Steve Graham,...

|

Sep 7th, 2023

|

preprint

Jacob Steiss, Tamara Tate, Steve Graham,...

Sep 7th, 2023

Offering students formative feedback on drafts of their writing is an effective way to facilitate writing development. This study examined the ability of generative AI (i.e., ChatGPT) to provide formative feedback on students’ compositions. We compared the quality of human and AI feedback by scoring the feedback each provided on secondary student essays (n=200) on five measures of feedback quality: the degree to which feedback (a) was criteria-based, (b) provided clear directions for...

Towards Human-Like Educational Question Generation with Large Language Models

Zichao Wang, Jakob Valdez, Debshila Basu...

|

Mar 10th, 2022

|

bookSection

Zichao Wang, Jakob Valdez, Debshila Basu...

Mar 10th, 2022

A Meta Systematic Review of Artificial Intelligence in Higher Education: A call for increased ethics, collaboration, and rigour

Melissa Bond, Hassan Khosravi, Maarten D...

|

Mar 10th, 2023

|

journalArticle

Melissa Bond, Hassan Khosravi, Maarten D...

Mar 10th, 2023

Towards Human-Like Educational Question Generation with Large Language Models

Zichao Wang, Jakob Valdez, Debshila Basu...

|

Mar 10th, 2022

|

bookSection

Zichao Wang, Jakob Valdez, Debshila Basu...

Mar 10th, 2022

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao

|

Mar 10th, 2024

|

preprint

Peiyi Wang, Lei Li, Zhihong Shao

Mar 10th, 2024

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for...

Towards Human-Like Educational Question Generation with Large Language Models

Zichao Wang, Jakob Valdez, Debshila Basu...

|

Mar 10th, 2022

|

bookSection

Zichao Wang, Jakob Valdez, Debshila Basu...

Mar 10th, 2022

Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future

Zheng Chu, Jingchang Chen, Qianglong Che...

|

Mar 10th, 2024

|

preprint

Zheng Chu, Jingchang Chen, Qianglong Che...

Mar 10th, 2024

Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence. Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM's reasoning capabilities, which attracts widespread attention from both academics and industry. In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives....

Comparing the quality of human and ChatGPT feedback of students’ writing

Jacob Steiss, Tamara Tate, Steve Graham,...

|

Jun 10th, 2024

|

journalArticle

Jacob Steiss, Tamara Tate, Steve Graham,...

Jun 10th, 2024

Structured Abstract Background Offering students formative feedback on their writing is an effective way to facilitate writing development. Recent advances in AI (i.e., ChatGPT) may function as an automated writing evaluation tool, increasing the amount of feedback students receive and diminishing the burden on teachers to provide frequent feedback to large classes. Aims We examined the ability of generative AI (ChatGPT) to provide formative feedback. We compared the quality of human and AI...

Fluid Language Model Benchmarking

Valentin Hofmann, David Heineman, Ian Ma...

|

Sep 14th, 2025

|

preprint

Valentin Hofmann, David Heineman, Ian Ma...

Sep 14th, 2025

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma

|

Jan 24th, 2025

|

preprint

Lei Huang, Weijiang Yu, Weitao Ma

Jan 24th, 2025

The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), fueling a paradigm shift in information acquisition. Nevertheless, LLMs are prone to hallucination, generating plausible yet nonfactual content. This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval (IR) systems and has attracted intensive research to detect and mitigate such hallucinations. Given the open-ended...

ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom)

Billy Ho Hung Cheung, Gary Kui Kai Lau, ...

|

Aug 29th, 2023

|

journalArticle

Billy Ho Hung Cheung, Gary Kui Kai Lau, ...

Aug 29th, 2023

Large language models, in particular ChatGPT, have showcased remarkable language processing capabilities. Given the substantial workload of university medical staff, this study aims to assess the quality of multiple-choice questions (MCQs) produced by ChatGPT for use in graduate medical examinations, compared to questions written by university professoriate staffs based on standard medical textbooks.

A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

Jacob Doughty, Zipiao Wan, Anishka Bompe...

|

Jan 29th, 2024

|

conferencePaper

Jacob Doughty, Zipiao Wan, Anishka Bompe...

Jan 29th, 2024

Write On with Cambi: The development of an argumentative writing feedback tool

Susan Lottridge, Amy Burkhardt, Christop...

|

journalArticle

Susan Lottridge, Amy Burkhardt, Christop...

Every year, millions of middle-school students write argumentative essays that are evaluated against a scoring rubric. However, the scores they receive don’t necessarily offer clear guidance on how to improve their essay or what they’ve done well. With advancements in natural language processing technology, we now have the capability to provide more detailed feedback. At this juncture, we’ve developed an artificial intelligence-supported editing tool to assist students in revising their...

A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level

Iddo Drori, Sarah Zhang, Reece Shuttlewo...

|

Aug 2nd, 2022

|

journalArticle

Iddo Drori, Sarah Zhang, Reece Shuttlewo...

Aug 2nd, 2022

We demonstrate that a neural network pretrained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI’s Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a dataset of questions from Massachusetts Institute of Technology (MIT)’s largest mathematics courses (Single Variable and Multivariable Calculus,...

Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative Comprehension

Ying Xu, Dakuo Wang, Mo Yu

|

Mar 10th, 2022

|

journalArticle

Ying Xu, Dakuo Wang, Mo Yu

Mar 10th, 2022

Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative...

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth...

|

Oct 5th, 2025

|

preprint

Tejal Patwardhan, Rachel Dias, Elizabeth...

Oct 5th, 2025

We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and...

PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat

|

May 17th, 2023

|

preprint

Rohan Anil, Andrew M. Dai, Orhan Firat

May 17th, 2023

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more...

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

|

Aug 15th, 2024

|

preprint

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

Aug 15th, 2024

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan A...

|

Mar 10th, 2021

|

journalArticle

Rishi Bommasani, Drew A. Hudson, Ehsan A...

Mar 10th, 2021

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...

Search

Publication year