ITP | Are Your AI Scores Good Enough?

December 3, 2021, Noon-1:30 pm Central Time

259 Educational Sciences

Daniel McCaffrey

Educational Testing Service

Use of computers to score performances from standardized evaluations, such as using artificial intelligence (AI) and natural language processing (NLP) to rate written or spoken responses to standardized test items,is growing rapidly in popularity. Currently, these methods are used in many testing situations to produce scores. For example, tens of millions of responses from elementary and secondary students are scored using computer-based automated scoring, and states like Ohio (Ohio Department of Education, 2018) are moving toward having all student responses from its elementary and secondary school testing program scored by such methods. Moreover, each year millions of responses from high stakes tests such as the GRER, TOEFLR, the Duolingo English Test, the Pearson Test of English and the Pearson Test of English Academic are also scored by computer-based automated methods. Use of AI scoring for assessments invariably leads to questions about the ability of the scores to support the claims of the items and the tests and the fairness of the scores. Typically, evaluation of scores involves statistical analyses of the agreement between AI scores and human ratings of the same constructed responses or the accuracy of scores as predictors of the human ratings. In this talk I will discuss an alternative framework for evaluating AI scores that focuses on building evidence to support claims about the scores. I will discuss how to use statistical analysis of AI scores in this framework and methods for assessing the fairness of scores.