Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan, Aounon Kumar, Sri...

|

Feb 19th, 2024

|

preprint

Vinu Sankar Sadasivan, Aounon Kumar, Sri...

Feb 19th, 2024

The unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, we show that these detectors are not...

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao

|

Feb 19th, 2024

|

preprint

Peiyi Wang, Lei Li, Zhihong Shao

Feb 19th, 2024

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for...

Generative AI Can Harm Learning

Hamsa Bastani, Osbert Bastani, Alp Sungu...

|

Jul 1st, 2024

|

preprint

Hamsa Bastani, Osbert Bastani, Alp Sungu...

Jul 1st, 2024

Generative artificial intelligence (AI) is poised to revolutionize how humans work, and has already demonstrated promise in significantly improving human productivity. However, a key remaining question is how generative AI affects learning, namely, how humans acquire new skills as they perform tasks. This kind of skill learning is critical to long-term productivity gains, especially in domains where generative AI is fallible and human experts must check its outputs. We study the impact of...

Leveraging Large Language Models for NLG Evaluation: A Survey

Zhen Li, Xiaohan Xu, Tao Shen

|

Jul 1st, 2024

|

journalArticle

Zhen Li, Xiaohan Xu, Tao Shen

Jul 1st, 2024

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing Large Language Models (LLMs) has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This survey aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to...

Adapting to AI: how to understand, prepare for, and innovate in a changing landscape

The Chronicle of Higher Educ...

|

Jul 1st, 2024

|

report

The Chronicle of Higher Educ...

Jul 1st, 2024

Can artificial Intelligence Technology Promote the Improvement of Student Learning Outcomes?——Meta Analysis Based on 50 Experimental and Quasi Experimental Studies

Lijuan Wang, Miaomiao Zhao

|

Jul 1st, 2024

|

conferencePaper

Lijuan Wang, Miaomiao Zhao

Jul 1st, 2024

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

Rose E. Wang, Qingyang Zhang, Carly Robi...

|

Jul 1st, 2024

|

preprint

Rose E. Wang, Qingyang Zhang, Carly Robi...

Jul 1st, 2024

Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought...

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, Mohan Kankanhalli...

|

Jul 1st, 2024

|

journalArticle

Ziwei Xu, Sanjay Jain, Mohan Kankanhalli...

Jul 1st, 2024

Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined...

Meet LLM360: The First Fully Open-Source and Transparent Large Language Models (LLMs)

Sana Hassan

|

Dec 13th, 2023

|

blogPost

Sana Hassan

Dec 13th, 2023

Open-source Large Language Models (LLMs) such as LLaMA, Falcon, and Mistral offer a range of choices for AI professionals and scholars. Yet, the majority of these LLMs have only made available select components like the end-model weights or inference scripts, with technical documents often narrowing their focus to broader design aspects and basic metrics. This approach restricts advances in the field by reducing clarity in the training methodologies of LLMs, leading to repeated efforts by...

Can AI Be Too Good to Use?

Andy Fell

|

Dec 12th, 2023

|

webpage

Andy Fell

Dec 12th, 2023

Much of the discussion around implementing artificial intelligence systems focuses on whether an AI application is “trustworthy”: Does it produce useful, reliable results, free of bias, while ensuring data privacy? But a new paper published Dec. 7 in Frontiers in Artificial Intelligence poses a different question: What if an AI is just too good?

Navigating the generative AI era: Introducing the AI assessment scale for ethical GenAI assessment

Mike Perkins, Leon Furze, Jasper Roe

|

Dec 12th, 2023

|

webpage

Mike Perkins, Leon Furze, Jasper Roe

Dec 12th, 2023

Recent developments in Generative Artificial Intelligence (GenAI) have created a paradigm shift in multiple areas of society, and the use of these technologies is likely to become a defining feature of education in coming decades. GenAI offers transformative pedagogical opportunities, while simultaneously posing ethical and academic challenges. Against this backdrop, we outline a practical, simple, and sufficiently comprehensive tool to allow for the integration of GenAI tools into...

Can AI Provide Useful Holistic Essay Scoring?

Tamara Tate, Jacob Steiss, Drew Bailey

|

Dec 5th, 2023

|

preprint

Tamara Tate, Jacob Steiss, Drew Bailey

Dec 5th, 2023

Researchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers across the nation on non-standardized prompts. This study analyzes the output of multiple versions of ChatGPT scoring of secondary student essays from three extant corpora and compares it to quality...

Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Lyle Regenwetter, Akash Srivastava, Dan ...

|

Dec 1st, 2023

|

journalArticle

Lyle Regenwetter, Akash Srivastava, Dan ...

Dec 1st, 2023

Math Education with Large Language Models: Peril or Promise?

Harsh Kumar, David M. Rothschild, Daniel...

|

Nov 22nd, 2023

|

preprint

Harsh Kumar, David M. Rothschild, Daniel...

Nov 22nd, 2023

The widespread availability of large language models (LLMs) has provoked both fear and excitement in the domain of education.On one hand, there is the concern that students will offload their coursework to LLMs, limiting what they themselves learn.On the other hand, there is the hope that LLMs might serve as scalable, personalized tutors.Here we conduct a large, pre-registered experiment involving 1200 participants to investigate how exposure to LLM-based explanations affect learning.In the...

ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs

Yann Hicke, Anmol Agarwal, Qianou Ma

|

Nov 13th, 2023

|

preprint

Yann Hicke, Anmol Agarwal, Qianou Ma

Nov 13th, 2023

Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enrollments. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG),...

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma

|

Nov 9th, 2023

|

preprint

Lei Huang, Weijiang Yu, Weitao Ma

Nov 9th, 2023

The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of...

Sources of Hallucination by Large Language Models on Inference Tasks

Nick McKenna, Tianyi Li, Liang Cheng

|

Oct 22nd, 2023

|

preprint

Nick McKenna, Tianyi Li, Liang Cheng

Oct 22nd, 2023

Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization. We present a series of behavioral studies on several LLM families (LLaMA, GPT-3.5, and PaLM) which probe their behavior using controlled experiments. We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First,...

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai

|

Oct 22nd, 2023

|

preprint

Ziang Xiao, Susu Zhang, Vivian Lai

Oct 22nd, 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai

|

Oct 22nd, 2023

|

preprint

Ziang Xiao, Susu Zhang, Vivian Lai

Oct 22nd, 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

Stanford University and Research Institutions Release Model Transparency Index

Igor Nowacki

|

Oct 19th, 2023

|

blogPost

Igor Nowacki

Oct 19th, 2023

Stanford University and Research Institutions Release Model Transparency Index TS2 SPACE

Search

Empirical studies

Empirical studies

Technical methods

Publication year