Search
5 resources
-
Yang Liu, Dan Iter, Yichong Xu|May 23rd, 2023|preprintYang Liu, Dan Iter, Yichong XuMay 23rd, 2023
The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references....
-
Jiaan Wang, Yunlong Liang, Fandong Meng,...|Apr 4th, 2023|journalArticleJiaan Wang, Yunlong Liang, Fandong Meng,...Apr 4th, 2023
Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with...
-
Jason Wei, Xuezhi Wang, Dale Schuurmans,...|Jan 10th, 2023|preprintJason Wei, Xuezhi Wang, Dale Schuurmans,...Jan 10th, 2023
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of...
-
Lei Huang, Weijiang Yu, Weitao Ma|Nov 9th, 2023|preprintLei Huang, Weijiang Yu, Weitao MaNov 9th, 2023
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of...
-
Abhimanyu Dubey, Abhinav Jauhri, Abhinav...|Aug 15th, 2024|preprintAbhimanyu Dubey, Abhinav Jauhri, Abhinav...Aug 15th, 2024
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...