Search
702 resources
-
Cynthia Lee, Kelvin C.K. Wong, William K...|Feb 27th, 2009|journalArticleCynthia Lee, Kelvin C.K. Wong, William K...Feb 27th, 2009
-
Klaus Zechner, Derrick Higgins, Xiaoming...|Oct 1st, 2007|journalArticleKlaus Zechner, Derrick Higgins, Xiaoming...Oct 1st, 2007
-
Anat Ben-Simon, Randy Elliot Bennett|Oct 27th, 2007|journalArticleAnat Ben-Simon, Randy Elliot BennettOct 27th, 2007
This study evaluated a “substantively driven” method for scoring NAEP writing assessments automatically. The study used variations of an existing commercial program, e-rater®, to compare the performance of three approaches to automated essay scoring: a brute-empirical approach in which variables are selected and weighted solely according to statistical criteria, a hybrid approach in which a fixed set of variables more closely tied to the characteristics of good writing was used but the...
-
Joseph P. Turian, Luke Shen, I. Dan Mela...|Jan 1st, 2006|conferencePaperJoseph P. Turian, Luke Shen, I. Dan Mela...Jan 1st, 2006
Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and their average, the F-measure. The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, this standard measure has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The...
-
Alon Lavie, Kenji Sagae, Shyamsundar Jay...|Oct 27th, 2004|bookSectionAlon Lavie, Kenji Sagae, Shyamsundar Jay...Oct 27th, 2004
-
Chin-Yew Lin, Franz Josef Och|Oct 27th, 2004|conferencePaperChin-Yew Lin, Franz Josef OchOct 27th, 2004
-
Paul Deane, Kathleen M. Sheehan|Apr 27th, 2003|conferencePaperPaul Deane, Kathleen M. SheehanApr 27th, 2003
-
George Doddington|Oct 27th, 2002|conferencePaperGeorge DoddingtonOct 27th, 2002
-
Lawrence M. Rudner, T. Liang|Oct 27th, 2002|conferencePaperLawrence M. Rudner, T. LiangOct 27th, 2002
-
Kishore Papineni, Salim Roukos, Todd War...|Oct 27th, 2001|conferencePaperKishore Papineni, Salim Roukos, Todd War...Oct 27th, 2001
-
Jill Burstein, Martin Chodorow|Oct 27th, 1999|conferencePaperJill Burstein, Martin ChodorowOct 27th, 1999
-
Jill Burstein, Martin Chodorow|Oct 27th, 1999|conferencePaperJill Burstein, Martin ChodorowOct 27th, 1999
The e-rater system™ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by...
-
journalArticle
-
Satanjeev Banerjee, Alon Lavie|journalArticleSatanjeev Banerjee, Alon Lavie
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score...
-
Jill Burstein|journalArticleJill Burstein
-
Lei Chen, Klaus Zechner, Su-Youn Yoon|reportLei Chen, Klaus Zechner, Su-Youn Yoon
-
demzsky|documentdemzsky
-
J. H. Fife|journalArticleJ. H. Fife
This report provides an introduction to the m-rater™ engine, ETS’s automated scoring engine for computer-delivered constructed-response items when the response is a number, an equation (or mathematical expression), or a graph. This introduction is intended to acquaint the reader with the types of items that m-rater can score, the requirements for authoring these items onscreen, the methods m-rater uses to score these items, and the features these items must possess to be reliably scored....
-
Harsh Kumar, David M. Rothschild, Daniel...|journalArticleHarsh Kumar, David M. Rothschild, Daniel...
The widespread availability of large language models (LLMs) has provoked both fear and excitement in the domain of education. On one hand, there is the concern that students will offload their coursework to LLMs, limiting what they themselves learn. On the other hand, there is the hope that LLMs might serve as scalable, personalized tutors. Here we conduct a large, pre-registered experiment involving 1200 participants to investigate how exposure to LLM-based explanations affect learning. In...
-
Susan Lottridge, Amy Burkhardt, Christop...|journalArticleSusan Lottridge, Amy Burkhardt, Christop...
Every year, millions of middle-school students write argumentative essays that are evaluated against a scoring rubric. However, the scores they receive don’t necessarily offer clear guidance on how to improve their essay or what they’ve done well. With advancements in natural language processing technology, we now have the capability to provide more detailed feedback. At this juncture, we’ve developed an artificial intelligence-supported editing tool to assist students in revising their...