Results – Evidence Library – Artificial Intelligence in Measurement and Education

Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Lyle Regenwetter, Akash Srivastava, Dan ...

|

Dec 14th, 2023

|

journalArticle

Lyle Regenwetter, Akash Srivastava, Dan ...

Dec 14th, 2023

ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs

Yann Hicke, Anmol Agarwal, Qianou Ma

|

Nov 13th, 2023

|

preprint

Yann Hicke, Anmol Agarwal, Qianou Ma

Nov 13th, 2023

Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enrollments. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG),...

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma

|

Nov 9th, 2023

|

preprint

Lei Huang, Weijiang Yu, Weitao Ma

Nov 9th, 2023

The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of...

Sources of Hallucination by Large Language Models on Inference Tasks

Nick McKenna, Tianyi Li, Liang Cheng

|

Oct 22nd, 2023

|

preprint

Nick McKenna, Tianyi Li, Liang Cheng

Oct 22nd, 2023

Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization. We present a series of behavioral studies on several LLM families (LLaMA, GPT-3.5, and PaLM) which probe their behavior using controlled experiments. We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First,...

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai

|

Oct 22nd, 2023

|

preprint

Ziang Xiao, Susu Zhang, Vivian Lai

Oct 22nd, 2023

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

The Internal State of an LLM Knows When It's Lying

Amos Azaria, Tom Mitchell

|

Oct 17th, 2023

|

preprint

Amos Azaria, Tom Mitchell

Oct 17th, 2023

While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement...

A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement

Surjodeep Sarkar, Manas Gaur, Lujie Kare...

|

Oct 12th, 2023

|

journalArticle

Surjodeep Sarkar, Manas Gaur, Lujie Kare...

Oct 12th, 2023

Virtual Mental Health Assistants (VMHAs) continuously evolve to support the overloaded global healthcare system, which receives approximately 60 million primary care visits and 6 million emergency room visits annually. These systems, developed by clinical psychologists, psychiatrists, and AI researchers, are designed to aid in Cognitive Behavioral Therapy (CBT). The main focus of VMHAs is to provide relevant information to mental health professionals (MHPs) and engage in meaningful...

eXplainable AI with GPT4 for story analysis and generation: A novel framework for diachronic sentiment analysis

Jon Chun, Katherine Elkins

|

Oct 11th, 2023

|

journalArticle

Jon Chun, Katherine Elkins

Oct 11th, 2023

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

Ning Miao, Yee Whye Teh, Tom Rainforth

|

Oct 5th, 2023

|

preprint

Ning Miao, Yee Whye Teh, Tom Rainforth

Oct 5th, 2023

The recent progress in large language models (LLMs), especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a...

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

Ted Zadouri, Ahmet Üstün, Arash Ahmadian...

|

Sep 11th, 2023

|

preprint

Ted Zadouri, Ahmet Üstün, Arash Ahmadian...

Sep 11th, 2023

The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient...

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Griffin Adams, Alexander Fabbri, Faisal ...

|

Sep 8th, 2023

|

preprint

Griffin Adams, Alexander Fabbri, Faisal ...

Sep 8th, 2023

Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries...

Approximating Online Human Evaluation of Social Chatbots with Prompting

Ekaterina Svikhnushina, Pearl Pu

|

Aug 25th, 2023

|

preprint

Ekaterina Svikhnushina, Pearl Pu

Aug 25th, 2023

As conversational models become increasingly available to the general public, users are engaging with this technology in social interactions. Such unprecedented interaction experiences may pose considerable social and psychological risks to the users unless the technology is properly controlled. This highlights the need for scalable and robust evaluation metrics for conversational chatbots. Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of...

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su

|

Aug 14th, 2023

|

preprint

Chi-Min Chan, Weize Chen, Yusheng Su

Aug 14th, 2023

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges

Ajay Bandi, Pydi Venkata Satya Ramesh Ad...

|

Jul 31st, 2023

|

journalArticle

Ajay Bandi, Pydi Venkata Satya Ramesh Ad...

Jul 31st, 2023

Generative artificial intelligence (AI) has emerged as a powerful technology with numerous applications in various domains. There is a need to identify the requirements and evaluation metrics for generative AI models designed for specific tasks. The purpose of the research aims to investigate the fundamental aspects of generative AI systems, including their requirements, models, input–output formats, and evaluation metrics. The study addresses key research questions and presents...

Artificial intelligence and the future of evaluation education: Possibilities and prototypes

Zach Tilton, John M. LaVelle, Tian Ford,...

|

Jun 14th, 2023

|

journalArticle

Zach Tilton, John M. LaVelle, Tian Ford,...

Jun 14th, 2023

Advancements in Artificial Intelligence (AI) signal a paradigmatic shift with the potential for transforming many various aspects of society, including evaluation education, with implications for subsequent evaluation practice. This article explores the potential implications of AI for evaluator and evaluation education. Specifically, the article discusses key issues in evaluation education including equitable language access to evaluation education, navigating program, social science, and...

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holt...

|

May 23rd, 2023

|

preprint

Tim Dettmers, Artidoro Pagnoni, Ari Holt...

May 23rd, 2023

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while...

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu

|

May 23rd, 2023

|

preprint

Yang Liu, Dan Iter, Yichong Xu

May 23rd, 2023

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references....

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans,...

|

Jan 10th, 2023

|

preprint

Jason Wei, Xuezhi Wang, Dale Schuurmans,...

Jan 10th, 2023

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of...

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

|

Oct 14th, 2023

|

journalArticle

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...

Oct 14th, 2023

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

|

Oct 14th, 2023

|

journalArticle

Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...

Oct 14th, 2023

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...

Search

Technical methods

Publication year