homethe billthe standardsthe testdiscussiontesting our schools
homethe billthe standardsthe testdiscussion
Testing, Assessment, and Excellence by John Merrow

an excerpt from Choosing Excellence: "Good Enough" Schools Are Not Good Enough (Scarecrow Press, 2001)

John Merrow has been reporting on education since 1974. He is the executive producer and host of The Merrow Report on both PBS and NPR, and president of Learning Matters, Inc., the non-profit organization that produces The Merrow Report's television, radio, and Web programs.

Excellent schools are accountable for their "products," the students who pass through their classrooms, but what exactly do we mean by the term accountable? This chapter explores the meaning of accountability, the growth of the standards movement, and an accompanying rise in what is called "high-stakes testing." To be forthright, I believe that high-stakes testing, in its current manifestation, is a serious threat to excellence and national standards. Unchecked, it will choke the life out of many excellent schools and drive gifted teachers out of classrooms. Unchecked, it will lead to debased and unnecessarily low standards.

High-stakes tests have serious consequences for those taking them, and sometimes in the careers of their teachers and administrators. A good example is the high school graduation test that students must pass in order to get a diploma. By the turn of the century, 28 states either already had or planned to have such tests.

A more rational approach is broad-based assessment, which involves multiple measures of what a student has learned. Assessment relies on teacher-made tests, teacher evaluations, student demonstrations, etc. all over an extended period of time, instead of one score on a single, largely machine-scored test (even if it includes a writing test). Unfortunately, the supporters of high-stakes testing have more faith in machines than they do in teachers.

The mad rush to embrace high-stakes testing says to me that we are now reaping what years of superficial indifference have sown. That is, for years educators have not held themselves accountable, so now business leaders and politicians are creating systems to hold schools accountable. As I will explain, the move to create standards is out of synch, and we're now testing with a vengeance, before the system has had time to get ready. ...

Behind the Standards

The push for standards began in force back in 1988, when President George Bush called the nation's governors together for the first-ever National Education Summit, held on the campus of the University of Virginia. Out of that largely theatrical meeting came a set of national education goals, some of which were actually written in the White House basement months later, some of which had been decided upon beforehand.

Goals begot standards, and here the White House has taken a back seat. The prime mover behind standards has been IBM's Louis V. Gerstner Jr., the prominent businessman who has been an education reformer for more than 20 years. Gerstner was the principal organizer behind the National Education Summit meetings in 1996 and 1999, meetings that involved nearly every state governor, America's business leaders, and President Clinton. Gerstner is aware of the growing backlash against high-stakes testing, but he's not backing off. "We can't slow down, because we hurt everybody when we slow down."

I asked him about fears that some children are being put at a disadvantage when new, tougher standards were suddenly imposed. He was visibly annoyed. "The argument that we shouldn't put standards in because some children are going to be hurt because they're not going to pass a test is fallacious. Children are being hurt today because they're passing the tests, and the tests are not recognizing that they cannot do what they need to do."

Reformer Ted Sizer is concerned not only about the speed with which the standards movement is moving but also about the driving force behind it, American business. "Business has put a lot of time into this, but in ways that have been simplistic," Sizer says. "But why should we be surprised, because if I were to presume that I could move in and run IBM, I'd get most of it wrong, because I don't know enough." The process, Sizer says, has been arrogant and costly, and the notion that there is one best curriculum to be decided upon by a small group, is dangerous. "Do we want a small group of people deciding what is supposed to go into our children's heads? I don't think so."

Gerstner laughed when I told him about Sizer's concerns. "Forty-nine states are setting standards, and Iowa is doing it in every local community. So we'll have 60 or 70 or 80 institutions in our society creating these standards, and every one of them is different." Gerstner says that in every state parents, educators, business leaders, and others are involved, including experts on standards in other countries.

Voting for standards is a lot easier than actually creating them. That task involves two types of standards: content standards and performance standards. Some person or group must decide on content: what, for example, eleventh-graders should master in English. Let's say the group agrees that eleventh-graders must be able to present a complex argument persuasively and must be familiar with drama, poetry, and fiction. Let's go further and say that they also agree that eleventh-graders should read and be able to understand a Shakespearean play. Assuming that they've gotten that far (and that's a big assumption, considering the cultural climate we live in), that's only halfway home. Now it's time to decide what levels of performance are "satisfactory," "outstanding," and "unsatisfactory." How much of that play does the eleventh-grader have to grasp to meet the new standard, and what does a satisfactory essay look like? These questions are neither trivial nor easy to answer.

We now enter the arbitrary process of standard-setting. Just what standards are established depends on who is asked. Each expert will have an idea of what is acceptable, outstanding, and insufficient. Are these ideas to be given arbitrary numerical weights and then averaged? Somehow a number is arrived at, and that number immediately takes on magical qualities -- it is what a student must achieve to pass, or to be promoted to the next grade, or to graduate.

The next steps are not trivial or automatic either. The curriculum must be adjusted, and then teachers have to be brought up to speed. Teachers who've grown accustomed to teaching certain materials in set ways may have to, in effect, start over. This may be beneficial for all concerned in the long run, but it will not happen overnight.

Multiply that scenario by the number of grade levels, the number of subjects, and the number of teachers (3,000,000), and you begin to understand the swamp that educators, politicians, business leaders, and others have waded into. And you also may begin to understand why testing has gotten ahead of developing and then implementing standards in many places.

Gerstner is clear about what needs to be done, even as he acknowledges that there will be casualties. "We need to make very significant investments now, to protect the people that will get hurt, because we're imposing a new system of high standards in an environment where there weren't any standards, and some children are going to get caught in the middle. How do we help them? Massive after school and summer training programs. We need to fund those. We need to train teachers to develop ways to bring these students up quickly." ...

Today we are rushing headlong in search of the "Holy Grail" of rising test scores. What seems to be happening is that the high-stakes testing movement has picked up momentum and gotten well ahead of the slower process of developing and implementing standards. Most policymakers are not as sophisticated as Gerstner, and many unfortunate decisions are being made as pressure for "accountability" overwhelms common sense. It's a whole lot easier to give a test than to do the hard work of retraining teachers and preparing students. ...

What's a Passing Score Anyway?

When decisions are made on the basis of a single test, teacher judgment is tossed out the window, along with a student's past performance. ... One Chicago parent told me tearfully about her son having to go to summer school. "He's a B student, but he missed the cutoff score by half a point on the test, because he was so nervous," she said.

Setting cutoff scores ("cut scores" in the language of testing) is an inexact science at best. That number which seems so firm and final may in fact be wholly arbitrary and subjective. Why is a 65 passing, and a 64.5 failing? Who made that decision, and on what basis? To George Madaus of Boston College, these situations are "obscene." "We're just kidding ourselves," he told me. "The technology is nowhere near being so precise that accurate decisions can be made on the basis of one or two points, one way or the other."

This is not just test-bashing. As Bob Sexton of Kentucky's Pritchard Committee (responsible for monitoring that state's reforms) says, "Test bashing with no alternative will likely lead to weaker not stronger public schools and give those who are opposed to improvement (such as ideologues or resistant educators) exactly what they want -- the status quo."

Having educational standards -- as opposed to not having them -- makes sense, of course, and most of the public seem to be enthusiastically behind the drive to create meaningful standards and curriculum that is aligned with those standards. But an unofficial "coalition" of frustrated business leaders, misguided politicians, short-sighted citizens, and ideologues is pushing us headlong toward the dangerous practice of making decisions based on single scores on tests that those taking them have not had the opportunity to prepare for.

In too many schools (like my daughter's), students and their teachers are not given a choice: It's pass the high-stakes achievement test or suffer the consequences. I believe that the trend toward high-stakes testing, and the related mind-numbing drill-drill-drill that often accompanies it, is behind the growth in private schools and home schooling.

Defenders of high-stakes tests argue that they are fair because students have multiple opportunities to pass them. Often this is true, but it doesn't change matters one iota. ... What we need are multiple measures, not multiple opportunities. ...

Behind the Muddle

The idea that student performance on standardized, norm-referenced, machine-scored tests is the primary indicator of school quality, and the principal measure of accountability, has been with us for about 40 years. It shows few signs of going away. We've grown accustomed to international, national, state, and local comparisons based on test scores, and we rarely look into the "why" of a number. Some reformers talk bravely about using other indicators of quality, such as attendance and dropout rates, college attendance, and teacher turnover, but at the end of the day test scores seem to push everything else aside.

American students are tested far more than their counterparts in other industrialized nations. Our elementary and secondary school students took more than 140 million standardized, machine-scored, multiple-choice tests in 1998, and 42 states mandate standardized testing. Eighth-graders are tested most often, with third- and fourth-graders just behind. Poor children face more of these tests than middle-class kids, in part because federal programs mandate testing. Monty Neill, director of an anti-testing group called FairTest, told me about one city's excesses. "At one time the city of Newark was testing the kids monthly. Every kid was tested virtually once a month."

The cruel irony is that more testing actually produces more reliable, and therefore more valuable, information. That's because there can be so much variability in individual test results, meaning your child's score may vary by large margins from one day to the next. Stanford University statistician David Rogosa has calculated that if the average fourth-grader were to take the widely used Stanford Nine (sometimes called the SAT 9), twice, he would have a 43 percent chance of having scores that are more than 10 percentile points apart. That is, he could be in the 75th percentile on one day and in the 60th the next. "If you could give the test a lot of times and take the average score, that would be approaching a gold-standard measurement," said Rogosa, who recently published an accuracy guide to the Stanford Nine. "In testing, because it's so expensive, we only get one shot."

According to the same story, the Orange County School District has informed parents that a student who ranks at the 50th percentile in reading actually could belong somewhere from the 40th to the 60th percentile, for example. Orange County spent at least $28 million last year to run the Stanford Nine test.

Richard Rothstein in the New York Times compared high-stakes testing to evaluating a baseball player's season with his performance in a randomly chosen week of the season, instead of his total performance. We don't do that in athletics for the most part, so why are we willing to treat our kids that way?

A single number spit out by a machine is powerful and seductive (even if some small portion of that test involves writing, which is graded by humans, not machines). What's more, that number is easy to understand. The fact that it is inevitably misleading does not seem to count for very much. ...

But the fundamental problem is that many schools and school districts use standardized test results more for accountability than understanding or diagnosis. I'm not blaming educators for this situation, because they're only following orders.

H. D. Hoover of the University of Iowa defends testing but agrees we've gone overboard. He places the blame squarely on politicians. "They want quick fixes, and they like tests because they're cheap. They mandate external tests because to the public it looks like they're doing something about education when all they're doing is actually a very inexpensive 'quick fix.'"

Hayes Mizell, the thoughtful director of the Program for Student Achievement at the Edna McConnell Clark Foundation, has a different view. He says that educators have come to expect others outside the schools to hold them accountable, instead of taking the initiative and holding themselves accountable. "They obsess over their students' performance on the state test, rather than over what their students really know and can do." He argues that educators ought to find and present school-based evidence, rather than obsessing over state tests and allowing those standardized, multiple-choice, machine-scored instruments to be the ultimate yardstick. He concludes, "Perhaps it is unrealistic to think that public education can do better, but I worry that if educators are focused more on their accountability to the state or school district than on their accountability to their students, their internal professionalism will wither."

Testing is not evil, of course. A primary purpose of school is academic learning, and we must know whether, and how much, students are learning. Well-made tests are an excellent way to measure learning and diagnose weaknesses. Excellent teachers create good tests, grade them carefully, and get them back to their students in a matter of days. Parents searching for excellence would be wise to ask the better students to describe the kinds of tests they take and the lag time between taking the test and getting it back.

Machine-scored, multiple-choice tests are rarely the best descriptive tool, and, as noted earlier, they're usually not intended as such. George Madaus of the National Board on Educational Testing and Public Policy at Boston College sums up the situation this way. "There are only three ways to test people. I can have you select an answer from a list -- that's a multiple-choice. Second, I can ask you to produce an answer in essay form. Third, I can ask you to do something -- fix a carburetor, or do a dive off a diving board, whatever -- and I can rate you on it."

In our adult lives, most of us take the third kind of test, that is, we're evaluated on performance. Some schools, notably those inspired by the work of Theodore Sizer's Coalition of Essential Schools, require students to demonstrate their mastery, by standing up in front of a group of adults or their own peers to "exhibit" what they have learned.

That's a far better way of evaluating, describing, and diagnosing, but it's also time-consuming and expensive, which means that it's unlikely to ever be more important than machine-scored, multiple-choice tests.

I am not criticizing standardized tests, because standardization is the key to fairness. When a test is standardized, it simply means that everyone has to take it under the same conditions. That is, you and I have to answer the same questions, in the same amount of time. As H. D. Hoover of Iowa notes, "It would not be fair to make comparisons if one student has three days to complete the test, and another has only ten minutes. Or if one student has the test questions read to him, while the other does not." Properly used, standardized tests are a source of useful information that helps teachers do a better job.

Tests, whether standardized or teacher-made, must also be both valid and reliable. Both adjectives are technical terms in testing. Valid tests measure what they are supposed to. For example, actually performing a series of dives would be a valid test of one's diving ability, while writing an essay about how these dives are performed or taking a multiple-choice test about diving would not. A test is reliable if it can be trusted (relied upon) to produce the same score, or nearly so, when it is given to the same group or individual again.

The argument is against multiple-choice tests and their impact on the curriculum. George Madaus supports testing, but he's well aware of the weaknesses of multiple-choice questions. "The adults who write the questions sometimes lose sight of the way kids will read those questions. There's a standardized test question that shows a cactus in a pot, a rose in a pot, and a cabbage, and the question is which needs the least amount of water. To the item-writer 'cactus' was the right answer, but some kids pick the cabbage. And the reason they gave was that the cabbage had been picked and so it didn't need water anymore. That's a perfectly good answer, but the machine had been set to score it as wrong." Students who get the "right" answer have demonstrated, perhaps, that they think like an adult -- or like the test-maker. The kid who thinks differently, or whose frame of reference is different, is marked down, and perhaps eliminated from the competition.

As Sizer notes, the real world is not "a series of set, pre-digested answers" but a set of questions. "Take the issue of cloning, an issue so difficult that very few teachers want to talk about it, or know how to talk about it. Cloning raises all sorts of difficult questions. What are the right answers? That can't be put on a multiple choice test."

"At the worst," Sizer adds, "these standardized tests provoke a kind of drilling mentality. It's a game. And so students learn the game. What they learn is to hire people to teach you how to figure the test out. Not the substance, but the test."

And that teaches cynicism. "The lesson learned is 'to get a high score, this is what you have to do. If you want to get ahead in life, jiggle the system.' And that's anti-intellectual and pernicious."

High-Stakes Testing

High-stakes tests now begin as early as first and second grades. Where high-stakes tests are being imposed by states, it has thrown many teachers and students into a state of anxiety. This is counter-productive, says E. D. Hirsch Jr. "They start prepping for tests, cramming for tests, and teaching for tests, and none of these things are educationally productive. My hope is that's a transitional phase." Hirsch believes in testing but not in studying for the test. "If the tests are good," he says, "the way to prepare for them is to have a good education. High test scores are a by-product of a good education."

I've seen first hand what an obsession with tests and test scores does to real learning. In the early '90s I spent three years at Woodward High School in Cincinnati, watching (and videotaping) as some teachers there attempted to adopt the philosophy and practices of the reform known as "The Coalition of Essential Schools." The ideas are easy to grasp: (l) "Less is more," meaning that students will dig deeply into a small number of topics instead of taking a broad survey approach; (2) for the most part teachers will not lecture but will guide and encourage learning; (3) teachers will work together across disciplines; and (4) students will be evaluated on the basis of their collected work (portfolios) and public demonstration of their knowledge.

For nearly two years the reform effort proceeded like most reforms, two steps forward and a step-and-a-half back, as students grudgingly learned that they couldn't merely regurgitate whatever the teacher said and expect to pass.

In response to growing public dissatisfaction with school outcomes, the state of Ohio had instituted its own exam, a high-stakes test that students had to pass to graduate. The test did not set the bar very high, and students were given at least eight chances to pass, beginning in tenth grade. Nevertheless, as test day drew near, the reform simply stopped in its tracks. No longer were teachers encouraging students to ask questions, to dig, to work together on projects, and to stand in front of their classes demonstrating their knowledge. Instead, classtime was given over to drill. "How many branches of government are there?" "Which branch institutes new laws?" "What is the role of the Executive Branch?" If I'd listened long enough, I'm sure I would have heard some teacher ask, "How many wives did Henry the Eighth have?"

The kids got the message: all this fancy talk about portfolios and demonstrations is just so much gas. Coalition teachers grew dispirited, and those Woodward high faculty members who preferred the old ways grew stronger in their resolve to keep on doing clings in the same old ways. ...

Some Sensible Approaches

The last word on accountability belongs to others. We need a clearer understanding of accountability and we need more measures of school outcomes that are not simply test-based. For example, apart from test scores, shouldn't we also seek to know how many young people finish school and graduate? How many continue their education after high school? How many students are put into special education classes and never leave them? What the attendance rates are for both teachers and students? The list of possible and knowable outcome measures goes on and on, but instead we seem to be willing to settle for that one simple test score.

Walt Haney of Boston College reminds us that accountability refers to more than consequences, but also to conduct, by which he means what actually happens inside schools between children and teachers. These transactions are tougher to measure; they don't lend themselves to easy numbers, but surely they count.

Emphasizing testing may eventually drive parents away from public schools, particularly from high-quality schools. Monty Neill of FairTest says he's already hearing that from parents in Massachusetts, where his organization is located. "This is not a major factor yet, but it could become one, as parents who think they have quality schools (and often do) recognize the damage that is done when schooling revolves around the search for higher test scores."

I may get in trouble here, because I want to suggest what constitutes excellence in testing and assessment. First of all, an excellent testing policy is transparent, that is, it is open for inspection by all who are interested and it is presented in clear English. It is understandable and defensible. It is connected to the curriculum and the goals of the school.

Excellent teachers have such policies. They explain in advance to students just what is expected, how they will be assessed, and why.

Excellent schools do not -- repeat, do not -- attempt to evaluate, promote, or hold back students on the basis of a single test, particularly a machine-scored, multiple-choice exam. That is, they reject high-stakes testing insofar as it is possible.

Teacher-made tests, constructed by excellent teachers, remain the best means of assessing student progress and weakness. The best teacher I ever had routinely tested his students by having us write short essays. He called them "2-8-2s" because we were given a topic and two minutes to think about it, then exactly eight minutes to write, followed by two minutes to make changes and corrections. He made the rules very clear: A major error (such as a sentence fragment) meant a grade of "zero." We could expect a 2-8-2 at least two or three times a week, but at the end of the semester he would discard our 10 lowest grades. We would get the papers back the next day! Often we wrote on some aspect of the play or poem we were studying, but he was just as likely to give us an obscure quote and instruct us to reflect on it.

These were excellent tests, aligned with his curriculum and the goals of his class. His policy was transparent. Without using any of today's jargon, he made clear to us what his standards (of content and performance) were, he allowed no lag time between the test and its return, and he used the results to diagnose and correct our weaknesses. That is excellence at work, and it is what all schools should strive for.

Holding schools or (especially) students accountable almost solely on the basis of student scores on machine-scored tests establishes a "whips and chains" system. When we do that, we're using tests as a weapon, nothing more.

Questions Worth Asking

  • What is the district's policy on high-stakes testing? If no policy, why not? How many machine-scored, multiple-choice tests will my child take each year? Are these "high-stakes" tests, and if so what is at stake?
  • Who mandates these tests (state, county, the National Assessment of Educational Progress)?
  • Do the results have an impact on specific children, or is it the school that is being measured and rated?
  • If individual students are not being evaluated, has any consideration been given to testing only a sample of students? After all, political pollsters question only a small sample of voters and predict results with uncanny accuracy. Why not take the same approach with educational testing that measures a school's or a system's health?
  • How much time is devoted to test preparation and practice?
  • How long does it take for the teachers to get the results, and are they returned in usable form?
  • How are the results used? Does the school share the results with students, parents, and the community?
  • Are the data carefully analyzed (disaggregated) to make sure that all students are learning? (A few high scores by outstanding students can give a misleading picture about the overall health of the system.)
  • Are students doing better over time? That is, do the data indicate that the longer a student goes to school, the more he or she learns (or the opposite)? Do teachers use a variety of assessments, including portfolios and exhibitions?
  • How significant are machine-scored tests as a part of a student's semester and final grades?
  • How much money is the district spending on outside testing? Does this dollar amount include the cost of the time that teachers and administrators spend on the tests and test preparation?
  • What are the district, the school, and individual teachers not doing because of these tests?
  • How does any particular test influence the curriculum? Has anyone in authority explored the influences of tests on the curriculum? If not, why not? Do teachers rely on their own tests for the most part, or do they use instruments created by others?
  • Is the student-teacher ratio sufficiently low to allow teachers time to create their own tests, grade them thoroughly, and discuss the results with students?
  • Regarding testing, how much can one child's score vary?
  • What are the academic standards for each grade? (Some parents will be astounded that some kids were reading and writing in kindergarten.)

Excerpted from Choosing Excellence: "Good Enough" Schools Are Not Good Enough (Scarecrow Press, 2001) by John Merrow. © 2001 by John Merrow. All rights reserved. Reprinted by permission of the author.

See the Merrow Report website for information on how to order Choosing Excellence.

home · no child left behind · challenge of standards · testing. teaching. learning?
introduction · in your state · parents' guide · producer's chat · interviews
video excerpts · discussion · tapes & transcripts · press
credits · privacy policy · FRONTLINE · wgbh · pbs online

some photographs ©2002 getty images all rights reserved
web site copyright WGBH educational foundation

SUPPORT PROVIDED BY