Results – Evidence Library – Artificial Intelligence in Measurement and Education

Valentin Hofmann, David Heineman, Ian Ma...

|

Sep 14th, 2025

|

preprint

Valentin Hofmann, David Heineman, Ian Ma...

Sep 14th, 2025

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce FLUID BENCHMARKING, a new evaluation approach...

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan A...

|

May 2nd, 2021

|

journalArticle

Rishi Bommasani, Drew A. Hudson, Ehsan A...

May 2nd, 2021

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan A...

|

Jul 12th, 2022

|

preprint

Rishi Bommasani, Drew A. Hudson, Ehsan A...

Jul 12th, 2022

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...

Search

Publication year