Full Library – Evidence Library – Artificial Intelligence in Measurement and Education

Some military experts are wary of generative AI

Ryan Heath

|

May 1st, 2024

|

webpage

Ryan Heath

May 1st, 2024

The Pentagon is hitting the brakes on the new technology even as business is charging forward.

PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models

Ishaan Watts, Varun Gumma, Aditya Yadava...

|

May 1st, 2024

|

journalArticle

Ishaan Watts, Varun Gumma, Aditya Yadava...

May 1st, 2024

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. Hence, it is difficult to do extensive evaluation of LLMs in the multilingual […]

The Neglected 15%: Positive Effects of Hybrid Human-AI Tutoring Among Students with Disabilities

Danielle R. Thomas, Erin Gatz, Shivang G...

|

Apr 29th, 2024

|

preprint

Danielle R. Thomas, Erin Gatz, Shivang G...

Apr 29th, 2024

Incorporating human tutoring with AI holds promise for supporting diverse math learners. In the U.S., approximately 15% of students receive special education services, with limited previous research within AIED on the impact of AI-assisted learning among students with disabilities. Previous work combining human tutors and AI suggests that students with lower prior knowledge, such as lacking basic skills, exhibit greater learning gains compared to their more knowledgeable peers. Building upon...

The Ethics of Advanced AI Assistants

Iason Gabriel, Arianna Manzini, Geoff Ke...

|

Apr 28th, 2024

|

preprint

Iason Gabriel, Arianna Manzini, Geoff Ke...

Apr 28th, 2024

This paper focuses on the opportunities and the ethical and societal risks posed by advanced AI assistants. We define advanced AI assistants as artificial agents with natural language interfaces, whose function is to plan and execute sequences of actions on behalf of a user, across one or more domains, in line with the user's expectations. The paper starts by considering the technology itself, providing an overview of AI assistants, their technical foundations and potential range of...

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

Sankalan Pal Chowdhury, Vilém Zouhar, Mr...

|

Apr 25th, 2024

|

preprint

Sankalan Pal Chowdhury, Vilém Zouhar, Mr...

Apr 25th, 2024

Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using Large Language Models (LLMs) to author Intelligent Tutoring Systems. A common pitfall of LLMs is their straying from desired pedagogical strategies such as leaking the answer to the student, and in general, providing no guarantees. We posit that while LLMs with certain guardrails can take the place of subject...

In-Context Learning for Scalable and Online Hallucination Detection in RAGS

Nicolò Cosimo Albanese

|

Apr 20th, 2024

|

conferencePaper

Nicolò Cosimo Albanese

Apr 20th, 2024

Ensuring fidelity to source documents is crucial for the responsible use of Large Language Models (LLMs) in Retrieval Augmented Generation (RAG) systems. We propose a lightweight method for real-time hallucination detection, with potential to be deployed as a model-agnostic microservice to bolster reliability. Using in-context learning, our approach evaluates response factuality at the sentence level without annotated data, promoting transparency and user trust. Compared to other...

Combining machine translation and automated scoring in international large-scale assessments

Ji Yoon Jung, Lillian Tyack, Matthias vo...

|

Apr 8th, 2024

|

journalArticle

Ji Yoon Jung, Lillian Tyack, Matthias vo...

Apr 8th, 2024

Artificial intelligence (AI) is rapidly changing communication and technology-driven content creation and is also being used more frequently in education. Despite these advancements, AI-powered automated scoring in international large-scale assessments (ILSAs) remains largely unexplored due to the scoring challenges associated with processing large amounts of multilingual responses. However, due to their low-stakes nature, ILSAs are an ideal ground for innovations and exploring new methodologies.

How Tech Giants Cut Corners to Harvest Data for A.I.

Cade Metz, Cecilia Kang, Sheera Frenkel,...

|

Apr 6th, 2024

|

newspaperArticle

Cade Metz, Cecilia Kang, Sheera Frenkel,...

Apr 6th, 2024

OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.

Designing Child-Centric AI Learning Environments: Insights from LLM-Enhanced Creative Project-Based Learning

Siyu Zha, Yuehan Qiao, Qingyu Hu

|

Apr 5th, 2024

|

preprint

Siyu Zha, Yuehan Qiao, Qingyu Hu

Apr 5th, 2024

Project-based learning (PBL) is an instructional method that is very helpful in nurturing students' creativity, but it requires significant time and energy from both students and teachers. Large language models (LLMs) have been proven to assist in creative tasks, yet much controversy exists regarding their role in fostering creativity. This paper explores the potential of LLMs in PBL settings, with a special focus on fostering creativity. We began with an exploratory study involving 12...

Artificial Intelligence and Curriculum Implementation in Public Secondary Schools of Federal Capital Territory, Abuja, Nigeria

Ahmed Mohammed, Umar Faiza Bashir, Abuba...

|

Apr 3rd, 2024

|

journalArticle

Ahmed Mohammed, Umar Faiza Bashir, Abuba...

Apr 3rd, 2024

The study assessed the impact of artificial intelligence on curriculum implementation in public secondary schools in Federal Capital territory, Abuja, Nigeria. The research design used for the study is descriptive survey. The population of the study comprises of the all the teachers in public secondary schools in FCT. The sample for the study is 320 respondents. The researcher formulated a questionnaire titled Artificial Intelligence on Curriculum Implementation Questionnaire (AICIQ). The...

Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT

Ruikun Hou, Tim Fütterer, Babette Bühler...

|

Apr 1st, 2024

|

preprint

Ruikun Hou, Tim Fütterer, Babette Bühler...

Apr 1st, 2024

Classroom observation protocols standardize the assessment of teaching effectiveness and facilitate comprehension of classroom interactions. Whereas these protocols offer teachers specific feedback on their teaching practices, the manual coding by human raters is resource-intensive and often unreliable. This has sparked interest in developing AI-driven, cost-effective methods for automating such holistic coding. Our work explores a multimodal approach to automatically estimating...

Predictors of middle school students’ perceptions of automated writing evaluation

Joshua Wilson, Fan Zhang, Corey Palermo,...

|

Apr 1st, 2024

|

journalArticle

Joshua Wilson, Fan Zhang, Corey Palermo,...

Apr 1st, 2024

This study examined middle school students' perceptions of an automated writing evaluation (AWE) system, MI Write. We summarize students' perceptions of MI Write's usability, usefulness, and desirability both quantitatively and qualitatively. We then estimate hierarchical entry regression models that account for district context, classroom climate, demographic factors (i.e., gender, special education status, limited English proficiency status, socioeconomic status, grade), students'...

Contextual evaluation of LLM’s performance on primary education science learning contents in the Yoruba language

Olanrewaju Lawal, Anthony Soronnadi, Olu...

|

Apr 1st, 2024

|

conferencePaper

Olanrewaju Lawal, Anthony Soronnadi, Olu...

Apr 1st, 2024

In the rapidly evolving era of artificial intelligence, Large Language Models (LLMs) like ChatGPT-3.5, Llama, and PaLM 2 play a pivotal role in reshaping education. Trained on diverse language data with a predominant focus on English, these models exhibit remarkable proficiency in comprehending and generating intricate human language constructs, revolutionizing educational applications. This potential has prompted exploration into personalized and enriched educational experiences,...

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Jon Saad-Falcon, Omar Khattab, Christoph...

|

Mar 31st, 2024

|

preprint

Jon Saad-Falcon, Omar Khattab, Christoph...

Mar 31st, 2024

Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential...

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie, Alex Xie, Divyanshu Sheth

|

Mar 31st, 2024

|

preprint

Yiqing Xie, Alex Xie, Divyanshu Sheth

Mar 31st, 2024

To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie, Alex Xie, Divyanshu Sheth

|

Mar 31st, 2024

|

preprint

Yiqing Xie, Alex Xie, Divyanshu Sheth

Mar 31st, 2024

To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...

Researchers at Stanford University Introduce 'pyvene': An Open-Source Python Library that Supports Intervention-Based Research on Machine Learning Models

Muhammad Athar Ganaie

|

Mar 16th, 2024

|

blogPost

Muhammad Athar Ganaie

Mar 16th, 2024

Understanding and manipulating neural models is essential in the evolving field of AI. This necessity stems from various applications, from refining models for enhanced robustness to unraveling their decision-making processes for greater interpretability. Amidst this backdrop, the Stanford University research team has introduced 'pyvene,' a groundbreaking open-source Python library that facilitates intricate interventions on PyTorch models. pyvene is ingeniously designed to overcome the...

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Simone Balloccu, Patrícia Schmidtová, Ma...

|

Feb 22nd, 2024

|

preprint

Simone Balloccu, Patrícia Schmidtová, Ma...

Feb 22nd, 2024

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the...

AI in education is a public problem

Ben Williamson

|

Feb 22nd, 2024

|

blogPost

Ben Williamson

Feb 22nd, 2024

Photo by Mick Haupt on Unsplash Over the past year or so, a narrative that AI will inevitably transform education has become widespread. You can find it in the pronouncements of investors, tech ind…

Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan, Aounon Kumar, Sri...

|

Feb 19th, 2024

|

preprint

Vinu Sankar Sadasivan, Aounon Kumar, Sri...

Feb 19th, 2024

The unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, we show that these detectors are not...

Search

Empirical studies

Empirical studies

Technical methods

Publication year