62 resources

  • Abhimanyu Dubey, Abhinav Jauhri, Abhinav...
    |
    Aug 15th, 2024
    |
    preprint
    Abhimanyu Dubey, Abhinav Jauhri, Abhinav...
    Aug 15th, 2024

    Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...

  • Bhashithe Abeysinghe, Ruhan Circi
    |
    Jun 5th, 2024
    |
    preprint
    Bhashithe Abeysinghe, Ruhan Circi
    Jun 5th, 2024

    Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains such as medicine, psychology, and general information retrieval are implemented rapidly. This, however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree...

  • Juan D. Pinto, Luc Paquette
    |
    May 22nd, 2024
    |
    preprint
    Juan D. Pinto, Luc Paquette
    May 22nd, 2024

    The challenge of creating interpretable models has been taken up by two main research communities: ML researchers primarily focused on lower-level explainability methods that suit the needs of engineers, and HCI researchers who have more heavily emphasized user-centered approaches often based on participatory design methods. This paper reviews how these communities have evaluated interpretability, identifying overlaps and semantic misalignments. We propose moving towards a unified framework...

  • Nicolò Cosimo Albanese
    |
    Apr 20th, 2024
    |
    conferencePaper
    Nicolò Cosimo Albanese
    Apr 20th, 2024

    Ensuring fidelity to source documents is crucial for the responsible use of Large Language Models (LLMs) in Retrieval Augmented Generation (RAG) systems. We propose a lightweight method for real-time hallucination detection, with potential to be deployed as a model-agnostic microservice to bolster reliability. Using in-context learning, our approach evaluates response factuality at the sentence level without annotated data, promoting transparency and user trust. Compared to other...

  • Jon Saad-Falcon, Omar Khattab, Christoph...
    |
    Mar 31st, 2024
    |
    preprint
    Jon Saad-Falcon, Omar Khattab, Christoph...
    Mar 31st, 2024

    Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential...

  • Zhen Li, Xiaohan Xu, Tao Shen
    |
    Mar 14th, 2024
    |
    journalArticle
    Zhen Li, Xiaohan Xu, Tao Shen
    Mar 14th, 2024

    In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing Large Language Models (LLMs) has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This survey aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to...

  • Lyle Regenwetter, Akash Srivastava, Dan ...
    |
    Dec 14th, 2023
    |
    journalArticle
    Lyle Regenwetter, Akash Srivastava, Dan ...
    Dec 14th, 2023
  • Yann Hicke, Anmol Agarwal, Qianou Ma
    |
    Nov 13th, 2023
    |
    preprint
    Yann Hicke, Anmol Agarwal, Qianou Ma
    Nov 13th, 2023

    Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enrollments. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG),...

  • Lei Huang, Weijiang Yu, Weitao Ma
    |
    Nov 9th, 2023
    |
    preprint
    Lei Huang, Weijiang Yu, Weitao Ma
    Nov 9th, 2023

    The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of...

  • Nick McKenna, Tianyi Li, Liang Cheng
    |
    Oct 22nd, 2023
    |
    preprint
    Nick McKenna, Tianyi Li, Liang Cheng
    Oct 22nd, 2023

    Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization. We present a series of behavioral studies on several LLM families (LLaMA, GPT-3.5, and PaLM) which probe their behavior using controlled experiments. We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First,...

  • Ziang Xiao, Susu Zhang, Vivian Lai
    |
    Oct 22nd, 2023
    |
    preprint
    Ziang Xiao, Susu Zhang, Vivian Lai
    Oct 22nd, 2023

    We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source...

  • Amos Azaria, Tom Mitchell
    |
    Oct 17th, 2023
    |
    preprint
    Amos Azaria, Tom Mitchell
    Oct 17th, 2023

    While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement...

  • Surjodeep Sarkar, Manas Gaur, Lujie Kare...
    |
    Oct 12th, 2023
    |
    journalArticle
    Surjodeep Sarkar, Manas Gaur, Lujie Kare...
    Oct 12th, 2023

    Virtual Mental Health Assistants (VMHAs) continuously evolve to support the overloaded global healthcare system, which receives approximately 60 million primary care visits and 6 million emergency room visits annually. These systems, developed by clinical psychologists, psychiatrists, and AI researchers, are designed to aid in Cognitive Behavioral Therapy (CBT). The main focus of VMHAs is to provide relevant information to mental health professionals (MHPs) and engage in meaningful...

  • Jon Chun, Katherine Elkins
    |
    Oct 11th, 2023
    |
    journalArticle
    Jon Chun, Katherine Elkins
    Oct 11th, 2023
  • Ning Miao, Yee Whye Teh, Tom Rainforth
    |
    Oct 5th, 2023
    |
    preprint
    Ning Miao, Yee Whye Teh, Tom Rainforth
    Oct 5th, 2023

    The recent progress in large language models (LLMs), especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a...

  • Ted Zadouri, Ahmet Üstün, Arash Ahmadian...
    |
    Sep 11th, 2023
    |
    preprint
    Ted Zadouri, Ahmet Üstün, Arash Ahmadian...
    Sep 11th, 2023

    The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient...

  • Griffin Adams, Alexander Fabbri, Faisal ...
    |
    Sep 8th, 2023
    |
    preprint
    Griffin Adams, Alexander Fabbri, Faisal ...
    Sep 8th, 2023

    Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries...

  • Ekaterina Svikhnushina, Pearl Pu
    |
    Aug 25th, 2023
    |
    preprint
    Ekaterina Svikhnushina, Pearl Pu
    Aug 25th, 2023

    As conversational models become increasingly available to the general public, users are engaging with this technology in social interactions. Such unprecedented interaction experiences may pose considerable social and psychological risks to the users unless the technology is properly controlled. This highlights the need for scalable and robust evaluation metrics for conversational chatbots. Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of...

  • Chi-Min Chan, Weize Chen, Yusheng Su
    |
    Aug 14th, 2023
    |
    preprint
    Chi-Min Chan, Weize Chen, Yusheng Su
    Aug 14th, 2023

    Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...

  • Ajay Bandi, Pydi Venkata Satya Ramesh Ad...
    |
    Jul 31st, 2023
    |
    journalArticle
    Ajay Bandi, Pydi Venkata Satya Ramesh Ad...
    Jul 31st, 2023

    Generative artificial intelligence (AI) has emerged as a powerful technology with numerous applications in various domains. There is a need to identify the requirements and evaluation metrics for generative AI models designed for specific tasks. The purpose of the research aims to investigate the fundamental aspects of generative AI systems, including their requirements, models, input–output formats, and evaluation metrics. The study addresses key research questions and presents...

Last update from database: 02/03/2025, 19:15 (UTC)
Powered by Zotero and Kerko.