Results – Evidence Library – Artificial Intelligence in Measurement and Education

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

|

Aug 15th, 2024

|

preprint

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

Aug 15th, 2024

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

Bhashithe Abeysinghe, Ruhan Circi

|

Jun 5th, 2024

|

preprint

Bhashithe Abeysinghe, Ruhan Circi

Jun 5th, 2024

Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains such as medicine, psychology, and general information retrieval are implemented rapidly. This, however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree...

Towards a Unified Framework for Evaluating Explanations

Juan D. Pinto, Luc Paquette

|

May 22nd, 2024

|

preprint

Juan D. Pinto, Luc Paquette

May 22nd, 2024

The challenge of creating interpretable models has been taken up by two main research communities: ML researchers primarily focused on lower-level explainability methods that suit the needs of engineers, and HCI researchers who have more heavily emphasized user-centered approaches often based on participatory design methods. This paper reviews how these communities have evaluated interpretability, identifying overlaps and semantic misalignments. We propose moving towards a unified framework...

In-Context Learning for Scalable and Online Hallucination Detection in RAGS

Nicolò Cosimo Albanese

|

Apr 20th, 2024

|

conferencePaper

Nicolò Cosimo Albanese

Apr 20th, 2024

Ensuring fidelity to source documents is crucial for the responsible use of Large Language Models (LLMs) in Retrieval Augmented Generation (RAG) systems. We propose a lightweight method for real-time hallucination detection, with potential to be deployed as a model-agnostic microservice to bolster reliability. Using in-context learning, our approach evaluates response factuality at the sentence level without annotated data, promoting transparency and user trust. Compared to other...

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Jon Saad-Falcon, Omar Khattab, Christoph...

|

Mar 31st, 2024

|

preprint

Jon Saad-Falcon, Omar Khattab, Christoph...

Mar 31st, 2024

Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential...

Leveraging Large Language Models for NLG Evaluation: A Survey

Zhen Li, Xiaohan Xu, Tao Shen

|

Jul 1st, 2024

|

journalArticle

Zhen Li, Xiaohan Xu, Tao Shen

Jul 1st, 2024

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing Large Language Models (LLMs) has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This survey aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to...

Search

Technical methods

Publication year