Search
60 resources
-
Zichao Wang, Jakob Valdez, Debshila Basu...|Jan 22nd, 2022|bookSectionZichao Wang, Jakob Valdez, Debshila Basu...Jan 22nd, 2022
-
Zichao Wang, Jakob Valdez, Debshila Basu...|Jan 22nd, 2022|bookSectionZichao Wang, Jakob Valdez, Debshila Basu...Jan 22nd, 2022
-
Peiyi Wang, Lei Li, Zhihong Shao|Jan 22nd, 2024|preprintPeiyi Wang, Lei Li, Zhihong ShaoJan 22nd, 2024
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for...
-
Zichao Wang, Jakob Valdez, Debshila Basu...|Jan 22nd, 2022|bookSectionZichao Wang, Jakob Valdez, Debshila Basu...Jan 22nd, 2022
-
Peiyi Wang, Lei Li, Zhihong Shao|Feb 19th, 2024|preprintPeiyi Wang, Lei Li, Zhihong ShaoFeb 19th, 2024
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for...
-
Zheng Chu, Jingchang Chen, Qianglong Che...|Jan 22nd, 2024|preprintZheng Chu, Jingchang Chen, Qianglong Che...Jan 22nd, 2024
Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence. Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM's reasoning capabilities, which attracts widespread attention from both academics and industry. In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives....
-
Jacob Steiss, Tamara Tate, Steve Graham,...|Jun 22nd, 2024|journalArticleJacob Steiss, Tamara Tate, Steve Graham,...Jun 22nd, 2024
Structured Abstract Background Offering students formative feedback on their writing is an effective way to facilitate writing development. Recent advances in AI (i.e., ChatGPT) may function as an automated writing evaluation tool, increasing the amount of feedback students receive and diminishing the burden on teachers to provide frequent feedback to large classes. Aims We examined the ability of generative AI (ChatGPT) to provide formative feedback. We compared the quality of human and AI...
-
Valentin Hofmann, David Heineman, Ian Ma...|Sep 14th, 2025|preprintValentin Hofmann, David Heineman, Ian Ma...Sep 14th, 2025
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...
-
Lei Huang, Weijiang Yu, Weitao Ma|Jan 24th, 2025|preprintLei Huang, Weijiang Yu, Weitao MaJan 24th, 2025
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), fueling a paradigm shift in information acquisition. Nevertheless, LLMs are prone to hallucination, generating plausible yet nonfactual content. This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval (IR) systems and has attracted intensive research to detect and mitigate such hallucinations. Given the open-ended...
-
Billy Ho Hung Cheung, Gary Kui Kai Lau, ...|Aug 29th, 2023|journalArticleBilly Ho Hung Cheung, Gary Kui Kai Lau, ...Aug 29th, 2023
Large language models, in particular ChatGPT, have showcased remarkable language processing capabilities. Given the substantial workload of university medical staff, this study aims to assess the quality of multiple-choice questions (MCQs) produced by ChatGPT for use in graduate medical examinations, compared to questions written by university professoriate staffs based on standard medical textbooks.
-
Jacob Doughty, Zipiao Wan, Anishka Bompe...|Jan 29th, 2024|conferencePaperJacob Doughty, Zipiao Wan, Anishka Bompe...Jan 29th, 2024
-
Iddo Drori, Sarah Zhang, Reece Shuttlewo...|Aug 2nd, 2022|journalArticleIddo Drori, Sarah Zhang, Reece Shuttlewo...Aug 2nd, 2022
We demonstrate that a neural network pretrained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI’s Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a dataset of questions from Massachusetts Institute of Technology (MIT)’s largest mathematics courses (Single Variable and Multivariable Calculus,...
-
Ying Xu, Dakuo Wang, Mo Yu|Jan 22nd, 2022|journalArticleYing Xu, Dakuo Wang, Mo YuJan 22nd, 2022
Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative...
-
Tejal Patwardhan, Rachel Dias, Elizabeth...|Oct 5th, 2025|preprintTejal Patwardhan, Rachel Dias, Elizabeth...Oct 5th, 2025
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and...
-
Rohan Anil, Andrew M. Dai, Orhan Firat|May 17th, 2023|preprintRohan Anil, Andrew M. Dai, Orhan FiratMay 17th, 2023
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more...
-
Abhimanyu Dubey, Abhinav Jauhri, Abhinav...|Aug 15th, 2024|preprintAbhimanyu Dubey, Abhinav Jauhri, Abhinav...Aug 15th, 2024
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...
-
Rishi Bommasani, Drew A. Hudson, Ehsan A...|Jan 22nd, 2021|journalArticleRishi Bommasani, Drew A. Hudson, Ehsan A...Jan 22nd, 2021
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...
-
Rishi Bommasani, Drew A. Hudson, Ehsan A...|Jul 12th, 2022|preprintRishi Bommasani, Drew A. Hudson, Ehsan A...Jul 12th, 2022
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...
-
Irina Jurenka, Markus Kunesch, Kevin McK...|May 14th, 2024|reportIrina Jurenka, Markus Kunesch, Kevin McK...May 14th, 2024
-
Irina Jurenka, Markus Kunesch, Kevin McK...|May 14th, 2024|documentIrina Jurenka, Markus Kunesch, Kevin McK...May 14th, 2024