Results – Evidence Library – Artificial Intelligence in Measurement and Education

Investigating Automatic Scoring and Feedback using Large Language Models

Gloria Ashiya Katuka, Alexander Gain, Ye...

|

May 1st, 2024

|

preprint

Gloria Ashiya Katuka, Alexander Gain, Ye...

May 1st, 2024

Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their...

FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis

S Christie, Baptiste Moreau-Pernet, Yu T...

|

Jul 24th, 2024

|

conferencePaper

S Christie, Baptiste Moreau-Pernet, Yu T...

Jul 24th, 2024

Large language models (LLMs) are increasingly being deployed in user-facing applications in educational settings. Deployed applications often augment LLMs with fine-tuning, custom system prompts, and moderation layers to achieve particular goals. However, the behaviors of LLM-powered systems are difficult to guarantee, and most existing evaluations focus instead on the performance of unmodified 'foun-dation' models. Tools for evaluating such deployed systems are currently sparse, inflexible,...

Whose ChatGPT? Unveiling Real-World Educational Inequalities Introduced by Large Language Models

Renzhe Yu, Zhen Xu, Sky CH-Wang

|

Nov 2nd, 2024

|

preprint

Renzhe Yu, Zhen Xu, Sky CH-Wang

Nov 2nd, 2024

The universal availability of ChatGPT and other similar tools since late 2022 has prompted tremendous public excitement and experimental effort about the potential of large language models (LLMs) to improve learning experience and outcomes, especially for learners from disadvantaged backgrounds. However, little research has systematically examined the real-world impacts of LLM availability on educational equity beyond theoretical projections and controlled studies of innovative LLM...

A Large-Scale Corpus for Assessing Written Argumentation: Persuade 2.0

Scott Andrew Crossley, Perpetual Baffour...

|

Mar 10th, 2024

|

preprint

Scott Andrew Crossley, Perpetual Baffour...

Mar 10th, 2024

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

Yu Li, Shenyu Zhang, Rui Wu

|

Mar 10th, 2024

|

preprint

Yu Li, Shenyu Zhang, Rui Wu

Mar 10th, 2024

Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models, especially in open-ended text, has consistently presented a significant challenge. Addressing this, recent work has explored the possibility of using LLMs as evaluators. While using a single LLM as an evaluation agent shows potential, it is filled with...

Uncertainty in Language Models: Assessment through Rank-Calibration

Xinmeng Huang, Shuo Li, Mengxin Yu

|

Mar 10th, 2024

|

conferencePaper

Xinmeng Huang, Shuo Li, Mengxin Yu

Mar 10th, 2024

Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future

Zheng Chu, Jingchang Chen, Qianglong Che...

|

Mar 10th, 2024

|

preprint

Zheng Chu, Jingchang Chen, Qianglong Che...

Mar 10th, 2024

Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence. Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM's reasoning capabilities, which attracts widespread attention from both academics and industry. In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives....

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

|

Aug 15th, 2024

|

preprint

Abhimanyu Dubey, Abhinav Jauhri, Abhinav...

Aug 15th, 2024

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language...

Search

Publication year