Search
13 resources
-
Yi Zheng, Steven Nydick, Sijia Huang|Apr 12th, 2023|conferencePaperYi Zheng, Steven Nydick, Sijia HuangApr 12th, 2023
The recent surge of machine learning (ML) has impacted many disciplines, including educational and psychological measurement (hereafter shortened as measurement, “M”). The measurement literature has seen a rapid growth in studies that explore using ML methods to solve measurement problems. However, there exist gaps between the typical paradigm of ML and fundamental principles of measurement. The MxML project was created to explore how the measurement community might potentially redefine the...
-
Ziang Xiao, Susu Zhang, Vivian Lai|Oct 22nd, 2023|preprintZiang Xiao, Susu Zhang, Vivian LaiOct 22nd, 2023
We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...
-
Ziang Xiao, Susu Zhang, Vivian Lai|Oct 22nd, 2023|preprintZiang Xiao, Susu Zhang, Vivian LaiOct 22nd, 2023
We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...
-
Rose E. Wang, Qingyang Zhang, Carly Robi...|Dec 27th, 2024|preprintRose E. Wang, Qingyang Zhang, Carly Robi...Dec 27th, 2024
Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought...
-
Tianyi Zhang, Varsha Kishore, Felix Wu|Feb 24th, 2020|preprintTianyi Zhang, Varsha Kishore, Felix WuFeb 24th, 2020
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection...
-
Chi-Min Chan, Weize Chen, Yusheng Su|Aug 14th, 2023|preprintChi-Min Chan, Weize Chen, Yusheng SuAug 14th, 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
-
Joshua Wilson, Fan Zhang, Corey Palermo,...|Apr 1st, 2024|journalArticleJoshua Wilson, Fan Zhang, Corey Palermo,...Apr 1st, 2024
This study examined middle school students' perceptions of an automated writing evaluation (AWE) system, MI Write. We summarize students' perceptions of MI Write's usability, usefulness, and desirability both quantitatively and qualitatively. We then estimate hierarchical entry regression models that account for district context, classroom climate, demographic factors (i.e., gender, special education status, limited English proficiency status, socioeconomic status, grade), students'...
-
Isabel O. Gallegos, Ryan A. Rossi, Joe B...|Dec 27th, 2023|preprintIsabel O. Gallegos, Ryan A. Rossi, Joe B...Dec 27th, 2023
Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural...
-
Hugh Zhang, Jeff Da, Dean Lee|May 3rd, 2024|preprintHugh Zhang, Jeff Da, Dean LeeMay 3rd, 2024
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k...
-
Long Ouyang, Jeff Wu, Xu Jiang|Mar 4th, 2022|preprintLong Ouyang, Jeff Wu, Xu JiangMar 4th, 2022
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through...
-
Abhimanyu Dubey, Abhinav Jauhri, Abhinav...|Aug 15th, 2024|preprintAbhimanyu Dubey, Abhinav Jauhri, Abhinav...Aug 15th, 2024
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...
-
Rohan Anil, Andrew M. Dai, Orhan Firat|May 17th, 2023|preprintRohan Anil, Andrew M. Dai, Orhan FiratMay 17th, 2023
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more...
-
Hugo Touvron, Louis Martin, Kevin Stone,...|Jul 19th, 2023|preprintHugo Touvron, Louis Martin, Kevin Stone,...Jul 19th, 2023
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our...