Truth in testing
Measuring Up: What Educational Testing Really Tells Us by Daniel Koretz. Cambridge, MA: Harvard University Press, 2008, 353 pp.
Paul W. Holland
It has been said that “few wish to be assessed, fewer still wish to assess others, but everyone wants to see the scores.” Throughout the world, tests are both extolled and disparaged, but they are very unlikely to go away because they provide information that few other sources can, at least at the same price. Tests were administered more than 5,000 years ago for civil service positions in China, and educational testing has been prevalent across the globe since the beginning of the 20th century. And although objections to testing are almost as common as the tests themselves, there seems to be no end to the ways that tests, for better or worse, affect our lives.
In Measuring Up, Daniel Koretz of Harvard’s Graduate School of Education gives a sustained and insightful explanation of testing practices ranging from sensible to senseless. Neither an attack on nor a defense of tests, the book is a balanced, accurate, and jargon-free discussion of how to understand the major issues that arise in educational testing.
This book grew out of courses Koretz teaches to graduate students who do not have strong mathematics backgrounds but who do need to know about testing as an instrument of public policy. He has succeeded admirably in producing a volume that will help such students and others who want to see through the current rhetoric and posturing that surrounds testing. Moreover, the book has a wealth of useful information for more technically trained readers who may know more about the formulas of psychometrics than about the realities of how testing is used in practice.
The century-old science of testing has a conceptual core that has matured and developed along with the mathematical structures needed to implement these concepts for the many types of educational tests. This book focuses on these key concepts (not the math) and how they should be used to guide informed decisions about the use and misuse of educational tests.
Koretz’s solution to the problem of technical jargon is his effective use of familiar nontesting examples to clarify testing concepts and ideas. For example, political polls use the results from a small sample of likely voters to predict actual voting outcomes and have become ubiquitous. Most readers will have some feeling about what a poll can do and what is meant by its margin of error. Likewise, a test is a sample of information collected from a test taker that, although incomplete, can be reasonably representative of a larger domain of knowledge. Using the test/poll analogy in a variety of ways, Koretz gives a clear account of how to interpret the reliability and uncertainty that are properly attached to test scores, as well as an explanation of the limitations of the test/poll analogy. Most important, Koretz emphasizes that just as a poll has value only to the extent that it represents the totality of relevant voters, performance on a test has value only to the extent that it accurately represents a test taker’s knowledge of a larger domain.
Koretz reminds us that many important and timely principles of testing were summarized by E. F. Lindquist in 1951, but these principles have been forgotten by many current advocates of accountability testing. Lindquist regarded the goals of education as diverse and noted that only some of them were amenable to standardized testing. He also realized that although the standardization of tests is important for the clarity of what is being measured, standardization also limits what can be measured. For this reason, Lindquist warned that test results should be used in conjunction with other, often less-standardized information about students in order to make good educational decisions.
Koretz provides a very clear discussion of the pros and cons of norm-referenced tests, criterion-referenced tests, minimum competency tests, measurement-driven instruction, performance assessment, performance standards, and most of the types of assessments and their rationales that are now part of the testing landscape. He complements this with a history of U.S. testing that shows the progression from a system in which tests had very low stakes for students, schools, and teachers to an elaborate regime in which test results have high stakes for many players.
The chapter on what influences test scores brilliantly skewers the simplistic answers—“It’s the schools!” “It’s the curriculum!” “It’s the teachers!”—by explaining how student-related characteristics such as parental education, parental aspirations and support for their children’s education, and student health, motivation, ethnic background, and cultural factors together have at least as much influence on scores. Koretz also provides an enlightening account of what social and economic status means and how it indirectly influences test scores, as well as a useful summary of the ways in which scientists make sound causal conclusions when faced with nonexperimental data (which is what simple comparisons of scores provide), with an emphasis on excluding plausible rival hypotheses. Koretz points out that although we often ask for expert advice regarding our children’s health or even our car problems and are willing to accept complex answers, we are often satisfied with simplistic claims when it comes to educational decisions: “Isn’t the best school the one with the best scores?” I propose that reading this chapter be made a requirement for education pundits.
The book’s discussion of international comparisons such as the Trends in International Mathematics and Science Study and Programme for International Student Assessment points out the false belief that socially homogeneous countries have less variation in test scores across students than do socially heterogeneous countries. The spread of scores is about the same in many countries that differ significantly in demographic diversity and average test scores. This fact is ignored by those who want to see all students achieve high scores. The admirable goal of universal success, which is implicit in the No Child Left Behind requirements, is simply not realistic.
In his discussion of testing students with special needs, Koretz unapologetically uses the politically incorrect term “limited English-proficient” because it is more descriptive of the problem that testing must address than the current euphemism of “English language learners.” As a former special education teacher, he strongly supports efforts to include students with special needs in the general curriculum, but he is less sure that our current knowledge gives us good guidance on how to include all such students in statewide or national educational assessments. He uses his own experience of being “limited Hebrew-proficient” as a young adult in Israel to personalize the issues that surround the problems of testing students in languages other than their primary language.
Koretz cites some of his own research in his chapter on the little-discussed problem of inflated test scores. He describes a few careful studies that show that high stakes often do lead to inflated scores; but more important, Koretz points out that this phenomenon is merely an example of Campbell’s Law: The more a quantitative social indicator (an average test score) is used for social decisionmaking (teacher salaries or school closings), the more likely it is that it will be corrupted and will distort the very social processes it is intended to measure. In fact, it would be unusual if high-stakes testing did not tend to corrupt scores. In keeping with his use of noneducational examples to clarify his points, Koretz gives other examples of Campbell’s Law, such as the tactics used to distort airline “on-time” rates and postal service delivery times.
According to Koretz, the common (and often staunchly defended) way to undermine to value of a test is “teaching to the test,” which results in performance on the test becoming no longer representative of a larger domain of knowl edge. This practice also tends to reallocate teaching time from subjects not on the test to those that are included. He recounts his own experience of the state of denial displayed by many high-stakes testing advocates when they use the excuse that “the perfect should not be the enemy of the good.” Koretz argues that dramatic increases in state test scores over time are often due to score inflation, especially when the test results have important consequences for teachers, schools, principals, and districts. Score inflation gives the illusion of progress while cheating students who deserve better and more effective schooling.
I found only two faults with Koretz’s book, one trivial, one substantive. Trivially, he clearly does not know how the National Assessment of Educational Progress “pantyhose” charts earned their nickname. The designation comes from their similarity in appearance to the sizing charts on pantyhose packages, not an obscure connection to runs in stockings. Substantively, in his discussion of how performance standards are set, he misses an opportunity to show just how counterintuitive performance standards are. Koretz only mentions those standard-setting methods in which majority rule is used to determine the “cut points” between designations such as basic and proficient. The counterintuitive fact is that a cut point should be set at a performance level where there is maximal disagreement; that is, where there is the largest number of people arguing for both of the options.
In the final chapter, Koretz gives thoughtful and reliable advice on the sensible uses of tests. Start by reminding yourself that scores describe some of what students can do, but they don’t describe all they can do and, most important, scores don’t explain why they can or cannot do it. His admonition that “decision makers should determine what goals are most important for a test and then accept the fact that the result will cost them in terms of other goals” encodes much of what this book is about. Is the goal to improve instruction in a particular topic or to find the poor teachers and improve them? Will a test created for one purpose do well for another? Probably not. A test created to provide useful information about groups of students is probably not very good at giving good information about any single student. Testing advocates who do not understand these ideas need to read Koretz’s book, hopefully before doing anything (else) silly.
Paul W. Holland ([email protected]) is Frederic M. Lord Chair in Measurement and Statistics Emeritus at the Educational Testing Service.