Search
59 resources
-
Wenchao Li, Haitao Liu|Jun 3rd, 2024|journalArticleWenchao Li, Haitao LiuJun 3rd, 2024
Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated listening tests, and automated oral proficiency assessments. The application of LLMs for AES in the context of non-native Japanese, however, remains limited. This study explores the potential of LLM-based AES by comparing the efficiency of different models, i.e. two conventional machine training...
-
Liu|Aug 22nd, 2024|conferencePaperLiuAug 22nd, 2024
-
Weizhe Yuan, Graham Neubig, Pengfei Liu,...|Jan 22nd, 2021|journalArticleWeizhe Yuan, Graham Neubig, Pengfei Liu,...Jan 22nd, 2021
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...
-
Weizhe Yuan, Graham Neubig, Pengfei Liu,...|Jan 22nd, 2021|journalArticleWeizhe Yuan, Graham Neubig, Pengfei Liu,...Jan 22nd, 2021
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference...
-
Kate Nolan, Youngsoon Kang, Ran Liu|Jan 22nd, 2025|presentationKate Nolan, Youngsoon Kang, Ran LiuJan 22nd, 2025
-
Yunting Liu, Shreya Bhandari, Zachary A....|Feb 24th, 2025|journalArticleYunting Liu, Shreya Bhandari, Zachary A....Feb 24th, 2025
Effective educational measurement relies heavily on the curation of well‐designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT‐3.5, GPT‐4, Llama 2, Llama 3, Gemini‐Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents....
-
Yunting Liu, Shreya Bhandari, Zachary A....|Feb 24th, 2025|journalArticleYunting Liu, Shreya Bhandari, Zachary A....Feb 24th, 2025
Effective educational measurement relies heavily on the curation of well‐designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT‐3.5, GPT‐4, Llama 2, Llama 3, Gemini‐Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents....
-
Ming Zhong, Yang Liu, Da Yin|Jan 22nd, 2022|preprintMing Zhong, Yang Liu, Da YinJan 22nd, 2022
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...
-
Ming Zhong, Yang Liu, Da Yin|Jan 22nd, 2022|preprintMing Zhong, Yang Liu, Da YinJan 22nd, 2022
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG...
-
Matthew S. Johnson, Xiang Liu, Daniel F....|Sep 22nd, 2022|journalArticleMatthew S. Johnson, Xiang Liu, Daniel F....Sep 22nd, 2022
-
Shashank Sonkar, Naiming Liu, Debshila M...|Jan 22nd, 2023|conferencePaperShashank Sonkar, Naiming Liu, Debshila M...Jan 22nd, 2023
-
Yue Huang, Corey Palermo, Ruitao Liu|Aug 27th, 2025|journalArticleYue Huang, Corey Palermo, Ruitao LiuAug 27th, 2025
-
Zheng Chu, Jingchang Chen, Qianglong Che...|Jan 22nd, 2024|preprintZheng Chu, Jingchang Chen, Qianglong Che...Jan 22nd, 2024
Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence. Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM's reasoning capabilities, which attracts widespread attention from both academics and industry. In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives....
-
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...|Jan 22nd, 2023|journalArticleJinlan Fu, See-Kiong Ng, Zhengbao Jiang,...Jan 22nd, 2023
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...
-
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang,...|Jan 22nd, 2023|journalArticleJinlan Fu, See-Kiong Ng, Zhengbao Jiang,...Jan 22nd, 2023
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities...
-
Cong Wang, Xiufeng Liu, Lei Wang|Apr 22nd, 2021|journalArticleCong Wang, Xiufeng Liu, Lei WangApr 22nd, 2021
-
Cong Wang, Xiufeng Liu, Lei Wang|Sep 9th, 2020|journalArticleCong Wang, Xiufeng Liu, Lei WangSep 9th, 2020
-
Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...|Jan 22nd, 2023|journalArticleGyeong-Geon Lee, Ehsan Latif, Xuansheng ...Jan 22nd, 2023
This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...
-
Annenberg Institute at Brown...|Jun 3rd, 2023|reportAnnenberg Institute at Brown...Jun 3rd, 2023
Providing consistent, individualized feedback to teachers is essential for improving instruction but can be prohibitively resource-intensive in most educational contexts. We develop M-Powering Teachers, an automated tool based on natural language processing to give teachers feedback on their uptake of student contributions, a high-leverage dialogic teaching practice that makes students feel heard. We conduct a randomized controlled trial in an online computer science course (n=1,136...
-
Gyeong-Geon Lee, Ehsan Latif, Xuansheng ...|Jun 22nd, 2024|preprintGyeong-Geon Lee, Ehsan Latif, Xuansheng ...Jun 22nd, 2024
This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment...