Search
59 resources
-
Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...|Jan 22nd, 2023|journalArticleGyeong-Geon Lee, Ehsan Latif, Xuansheng ...Jan 22nd, 2023
This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...
-
Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...|Jun 22nd, 2024|journalArticleGyeong-Geon Lee, Ehsan Latif, Xuansheng ...Jun 22nd, 2024
-
Yiqing Xie, Alex Xie, Divyanshu Sheth|Mar 31st, 2024|preprintYiqing Xie, Alex Xie, Divyanshu ShethMar 31st, 2024
To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks that only requires light guidance from humans. Specifically, we leverage a large language model (LLM) to convert an arbitrary piece of code into an evaluation example, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples...
-
Yang Liu, Dan Iter, Yichong Xu|Jan 22nd, 2023|preprintYang Liu, Dan Iter, Yichong XuJan 22nd, 2023
The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references....
-
Yan Zhuang, Qi Liu, Zhenya Huang|Jun 28th, 2022|journalArticleYan Zhuang, Qi Liu, Zhenya HuangJun 28th, 2022
Computerized Adaptive Testing (CAT) refers to an efficient and personalized test mode in online education, aiming to accurately measure student proficiency level on the required subject/domain. The key component of CAT is the "adaptive" question selection algorithm, which automatically selects the best suited question for student based on his/her current estimated proficiency, reducing test length. Existing algorithms rely on some manually designed and pre-fixed informativeness/uncertainty...
-
Xuansheng Wu, Padmaja Pravin Saraf, Gyeo...|Feb 21st, 2025|preprintXuansheng Wu, Padmaja Pravin Saraf, Gyeo...Feb 21st, 2025
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI's scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score...
-
Jiangang Hao, Wenju Cui, Patrick C. Kyll...|Jan 22nd, 2025|conferencePaperJiangang Hao, Wenju Cui, Patrick C. Kyll...Jan 22nd, 2025
-
Ou Lydia Liu, Chris Brew, John Blackmore...|Mar 6th, 2014|journalArticleOu Lydia Liu, Chris Brew, John Blackmore...Mar 6th, 2014
Content‐based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept‐based scoring tool for content‐based scoring, c‐rater™, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest...
-
T. Solorio, M. Sherman, Y. Liu|Oct 22nd, 2010|journalArticleT. Solorio, M. Sherman, Y. LiuOct 22nd, 2010
In this work we study how features typically used in natural language processing tasks, together with measures from syntactic complexity, can be adapted to the problem of developing language profiles of bilingual children. Our experiments show that these features can provide high discriminative value for predicting language dominance from story retells in a Spanish–English bilingual population of children. Moreover, some of our proposed features are even more powerful than measures commonly...
-
Nigel Fernandez, Aritra Ghosh, Naiming L...|Jan 22nd, 2022|preprintNigel Fernandez, Aritra Ghosh, Naiming L...Jan 22nd, 2022
Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two...
-
Chi-Min Chan, Weize Chen, Yusheng Su|Aug 14th, 2023|preprintChi-Min Chan, Weize Chen, Yusheng SuAug 14th, 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
-
Chi-Min Chan, Weize Chen, Yusheng Su|Aug 14th, 2023|preprintChi-Min Chan, Weize Chen, Yusheng SuAug 14th, 2023
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human...
-
Xiner Liu, Andrés Zambrano, Ryan Baker|Mar 5th, 2025|journalArticleXiner Liu, Andrés Zambrano, Ryan BakerMar 5th, 2025
This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies-Zero-shot, Few-shot, and Few-shot with contextual information-as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I...
-
Steven Moore, Eamon Costello, Huy A. Ngu...|Jan 22nd, 2024|bookSectionSteven Moore, Eamon Costello, Huy A. Ngu...Jan 22nd, 2024
-
Lei Huang, Weijiang Yu, Weitao Ma|Jan 24th, 2025|preprintLei Huang, Weijiang Yu, Weitao MaJan 24th, 2025
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), fueling a paradigm shift in information acquisition. Nevertheless, LLMs are prone to hallucination, generating plausible yet nonfactual content. This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval (IR) systems and has attracted intensive research to detect and mitigate such hallucinations. Given the open-ended...
-
Steven Moore, John Stamper, Richard Tong...|Jul 7th, 2023|conferencePaperSteven Moore, John Stamper, Richard Tong...Jul 7th, 2023
-
Andrew M. Olney, Steven Moore, John Stam...|Jul 7th, 2023|conferencePaperAndrew M. Olney, Steven Moore, John Stam...Jul 7th, 2023
-
Matyáš Boháček, Steven Moore, John Stamp...|Jul 7th, 2023|conferencePaperMatyáš Boháček, Steven Moore, John Stamp...Jul 7th, 2023
-
Shashank Sonkar, Richard G. Baraniuk, St...|Jul 7th, 2023|conferencePaperShashank Sonkar, Richard G. Baraniuk, St...Jul 7th, 2023
-
Md Rayhan Kabir, Fuhua Lin, Steven Moore...|Jul 7th, 2023|conferencePaperMd Rayhan Kabir, Fuhua Lin, Steven Moore...Jul 7th, 2023