Search
33 resources
-
Xiner Liu, Andrés Zambrano, Ryan Baker|Mar 5th, 2025|journalArticleXiner Liu, Andrés Zambrano, Ryan BakerMar 5th, 2025
This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies-Zero-shot, Few-shot, and Few-shot with contextual information-as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I...
-
Isabel O. Gallegos, Ryan A. Rossi, Joe B...|Dec 14th, 2023|preprintIsabel O. Gallegos, Ryan A. Rossi, Joe B...Dec 14th, 2023
Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural...
-
Isabel O. Gallegos, Ryan A. Rossi, Joe B...|Dec 14th, 2023|preprintIsabel O. Gallegos, Ryan A. Rossi, Joe B...Dec 14th, 2023
Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural...
-
Hugh Zhang, Jeff Da, Dean Lee|May 3rd, 2024|preprintHugh Zhang, Jeff Da, Dean LeeMay 3rd, 2024
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k...
-
Jacob Doughty, Zipiao Wan, Anishka Bompe...|Jan 29th, 2024|conferencePaperJacob Doughty, Zipiao Wan, Anishka Bompe...Jan 29th, 2024
-
Iddo Drori, Sarah Zhang, Reece Shuttlewo...|Aug 2nd, 2022|journalArticleIddo Drori, Sarah Zhang, Reece Shuttlewo...Aug 2nd, 2022
We demonstrate that a neural network pretrained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI’s Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a dataset of questions from Massachusetts Institute of Technology (MIT)’s largest mathematics courses (Single Variable and Multivariable Calculus,...
-
Ying Xu, Dakuo Wang, Mo Yu|Dec 14th, 2022|journalArticleYing Xu, Dakuo Wang, Mo YuDec 14th, 2022
Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative...
-
Long Ouyang, Jeff Wu, Xu Jiang|Mar 4th, 2022|preprintLong Ouyang, Jeff Wu, Xu JiangMar 4th, 2022
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through...
-
Rishi Bommasani, Drew A. Hudson, Ehsan A...|Dec 14th, 2021|journalArticleRishi Bommasani, Drew A. Hudson, Ehsan A...Dec 14th, 2021
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...
-
Rishi Bommasani, Drew A. Hudson, Ehsan A...|Jul 12th, 2022|preprintRishi Bommasani, Drew A. Hudson, Ehsan A...Jul 12th, 2022
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical...
-
Abhimanyu Dubey, Abhinav Jauhri, Abhinav...|Aug 15th, 2024|preprintAbhimanyu Dubey, Abhinav Jauhri, Abhinav...Aug 15th, 2024
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...
-
Rohan Anil, Andrew M. Dai, Orhan Firat|May 17th, 2023|preprintRohan Anil, Andrew M. Dai, Orhan FiratMay 17th, 2023
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more...
-
Hugo Touvron, Louis Martin, Kevin Stone,...|Jul 19th, 2023|preprintHugo Touvron, Louis Martin, Kevin Stone,...Jul 19th, 2023
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our...