Interviews - George Madaus | Testing Our Schools | FRONTLINE

FRONTLINE

	George Madaus is a professor of education and public policy and a senior fellow with The National Board on Educational Testing and Public Policy at Boston College. He has analyzed the testing industry for more than 30 years. Madaus tells FRONTLINE that tests can and should be used to hold schools accountable, but not students, and that it is "bad practice" to judge a student's performance on the basis of tests scores alone. This interview was conducted by correspondent John Merrow on May 24, 2001.

The thing that people talk about now in terms of measurement is this idea of a standard error of measurement. It sounds as if what they're saying is that every test every score has some kind of error built into it.

... I always looked on testing as a technology. It fits any definition you want of a technology. It has underlying algorithms; it has paper and pencil answer sheets in scoring things. But it's a fallible technology, and like all technologies, there are places where it can break down.

Some of the error might be built into the test. For example, an item that is mis-keyed can get on the test. That's one source of error. The test can be too long. That might be another source of error. It can be a hot day, and the air conditioning breaks down. That can be another source of error. For an individual kid, the kid had a fight with the parents that morning or is sick, is coming done with the flu. That's another source of error.

All of these kinds of errors affect what the kid does on the test. ... And all of that contributes to the person's performance, not adequately representing what the kid knows or is able to do. And there are ways to estimate error.

I think back on the elections. Whenever there's a campaign, they'll release the poll and they'll say, "Madaus is ahead of Merrow, 46 percent, 42 percent." And then they'll say, "And the error is ..."

... I like to think of it also in terms of medical tests. I asked my cardiologist one time, "What's the error associated with my cholesterol?" She said, "About 20 percent." ... In medical tests, they know what those errors are. What they do there [in medicine] that we don't do in education, they then go out and get other measures and they put all this information together. And then clinical judgment enters in and a treatment is either given or may not be may be counterindicated.

In education, we tend to take that score and ... act on it, and not necessarily get other indices in these high-stakes testing programs.

And that's not the right thing to do?

No, I don't think it is. ... [Say] you get a 220 [on a state test]. I know right away that that can be, 67 times out of 100, a 215 or a 225. ... It isn't this precision [measurement] that people think it is. There's a range in which your true score falls. A true score is if I could test you over and over and over again and I could estimate what your true performance level is. OK? It's a construct. It's an imaginary thing. ...

[But in Massachusetts, for example, the state says 220 is the cut score on its test, the MCAS. Students must score 220 on the MCAS in order to graduate.] So if I got a 219, could that as easily have been a 221?

Oh, sure. And a 221 could as easily have been a 218. ...

We've been following one student, an 11th grader named Madeline Valera. She scored a 216 on the math MCAS test, so she didn't pass. ... How firm a number is that for Madeline?

Depending on what the standard error is, if the standard error is four, let's say, it could be 212. [If] it's six, then it's 222 to 210. That's the error band. ...

And so?

If, in fact, it was 222, she should have passed.

So Massachusetts says, "Madeline, take the test again."

Yes. And Massachusetts could have done what your doctor would do and say, "Let's see if we can figure out another way you can demonstrate these competencies." You don't have to do it for everybody. But we do it for people like Madeline around the cut score. ...

Now, I understand that the last time the [MCAS] was given, there were about 3,000 students, 10th graders, who were on that line. They were below 220, but just barely below it.

That could very well be. ...

What is your argument against using that score [to prevent someone from graduating from high school]?

The technology isn't up to that. I mean, one of the things that people don't see is the arcane underpinning of all this. There's something called the "three parameter item response theory" -- algorithms that they use to arrive at these scores. These involve assumptions, these involve rounding, these involve all kinds of things. ...

One of the things that legislators or governors can do is impose tests. And they don't have to worry about what goes on in classrooms; they don't have to get into the messy details.

Maybe if I had rounded a different way or I had put in a different assumption into my program, I might have gotten a slightly different result -- not a dramatically different result, but a slightly different result. And just to take a number and say, "This is your score. This is it," we know that it isn't the true score.

The people who make these policies know that when they say, "OK, 220 is passing and 219 is failing" -- they know these scores are squishy?

Sure.

Why do they do it?

Well, again, you're into a series of issues. One of them is political. You don't want to seem to be fiddling around with the cut scores. These cut scores somehow, as I said, get reified. That's one reason. The other reason is that you're going to get a lot of backlash when people really understand how this classification system works, and they don't want to deal with that. And the other reason is that they, again, leave out one of the most important informational things we have about these kids, and that's teacher judgments.

They don't trust teachers?

No, that's why we have a lot of these state testing programs. They simply don't trust teachers. Interestingly, a recent poll shows that teachers are one of the most trusted groups and professional groups in American society.

There's a long history of kids being pushed along, [falsely promoted]. ... Maybe there's a good reason for not trusting teachers?

Certainly I think there should be testing programs. I mean, I think testing gives really valuable information on how schools and systems are doing, particularly certain systems that traditionally serve certain populations of kids, ESL kids, special-ed kids, minority kids, poor kids. Tests' information can throw a lot of light on that.

One of the reasons [a cut score is] used is [it's] a stick that you can beat people with. But it's the kid you're making the decision on. We can get very adequate information on how a school is doing, or how a system is doing, or how a state is doing without saying, "If you don't pass this test, you're not going to get a diploma, or you're not going to go to the next grade." You don't have to do that in my opinion.

But I think that these tests can and should be used to judge and hold the system and schools accountable.

Hold the grownups' feet to the fire?

That's right, yes. ... When all this started, there were going to be not just performance standards; there were going to be "opportunity to learn" standards. ...

[What are opportunity to learn standards?]

Originally, that was that there would be standards for textbooks, classrooms, teacher training, teacher experience, funding levels -- a whole series of things like that. But this was back when it was pie-in-the-sky talk of a national test. And at the time, people raised the issue that it isn't fair to impose a punitive testing system without first making sure kids had the opportunity to learn whatever the test was going to teach.

So if you're going to test in chemistry, then every kid ought to have--

Certainly have access to a lab and to a qualified chemistry teacher, and things like that. ... [But] start right down in the kindergarten, a kid [should have] a good meal before the kid comes to school.

Politically, [the opportunity to learn standards] went down in flames, because a lot of the governors didn't want to get into that whole area at the time. ... And those disappeared. So there isn't a level playing field for a lot of these schools and kids.

And also, the one-size-fits-all [test] bothers me a lot. ... I want to be a bricklayer, so I go to a vo-tech school [where] I'm learning bricklaying. Why should I take the same science test as someone who is going to go to Harvard?

Why shouldn't you?

Because I have different life's choices, and I have different interests, and I have different ability.

Isn't that an awfully elitist argument?

As I said, a lot of this is philosophical and ideological. ...

These are basic [skills tests].

They're not. The MCAS is not a basic skills test. I have no trouble with a basic skills test. And in some states, it is a basic skills test [but] MCAS is not a basic skills test. ...

The National Goals Panel issued a report on eighth-grade science and where the United States fell relative to other countries in the world. I think there were 45 countries. Now, the headlines [said], "United States does mediocre at best." If you break it out by states, five states, including Massachusetts, scored as high or higher than every country except Singapore. OK? Now, on the eighth-grade MCAS science test in 1999, 73 percent of the kids either failed it or needed improvement. There's a disconnect here. So that science test is not a basic skill science test.

Should all kids be able to read, write and compute? Yes. But do all kids need to read the same thing and have the same level of math and sciences? No, I don't think so. If that's elitist, then I guess I'm an elitist. ...

Are you saying the MCAS, the Massachusetts test, is a bad test?

No, I didn't say that. In fact, it is state of the art. I mean, all these tests have limitations. It has limitations, the same limitations that any test done by very qualified, very conscientious professionals building that test at that company, Harcourt. ... These guys and women know what they're doing. Nonetheless, the technology has inherent limitations to it.

Your objection is to the use of the score?

Yes, it's the use of the score, right. To give it the precision that they want to give it is crazy. ... It's bad practice. I think that they need to get, again, second opinions. They need to get clinical judgment of teachers. They need to get other measures, and then come up with a decision about Johnny, if they want to do that. ...

You said it's a good idea to have these tests, and to use them to hold schools and the adults in them accountable. But if the tests have no meaning to the kids--

... A lot of people think that if we don't put the pressure on the kid, then nothing is going to happen. And that's an argument with a moral dimension. Well, why should you punish the kid in order to whip the system into line?

There was a piece in The New York Times about three weeks ago on the foot-and-mouth disease issue in England. And the article was about whether you vaccinate these cows or not, and sheep. And one of the arguments is if you vaccinate, then you can't tell whether the beast has hoof and mouth after that. I mean, it masks it. And one of the leading veterinarians in England said, "We are developing tests that will give us a very accurate measure of the health of the herd, but doesn't give us accurate measure on an individual beast."

And in an analogous way, that's the same issue we have here on educational testing. And it isn't just pass-fail; it's the difference between proficient and advanced, between needs improvement and proficient. There's error all along that scale.

I think you should get the numbers [from the tests]. I think you can use the numbers. But I think you need to use them with other kinds of information about the thing you're trying to make decisions about. ...

We have a system set up which holds kids accountable. [We say],
"Pass this [test] if you want to graduate." You're saying it hurts kids. ... But normally, somebody's benefitting somewhere. Who's benefitting from this system that we're in the middle of?

Well, politicians are. Test companies are. Teachers in schools are. And the teachers in schools with the standards-based reform, they look at the standards, and it helps them better understand what they should be teaching. It gives them good information about the level of performance expected. All that's to the good.

The other thing we ought to look at is, how do other Western industrialized countries [test]? I don't know of any that test the way we do below age 16 and 18. ...

The argument here is that we need these tests [because] they enable us to have a meritocracy, or identify the best and the brightest who do well on these tests, and will get opportunities they might not otherwise have had.

In this country? We know who isn't doing well. We don't need these tests to tell us who is having a hard time and who's in trouble in school. ... You can ask any classroom teacher and they can tell you ... who the kids are that are having trouble in math, reading -- you name the subject. We know that certain populations are poorly served, that there aren't schools that aren't doing a good job for these. We know that. But now we've added this test as a quasi-documentary of those problems that we've known, for years, exist. And you're not going to test your way out of those problems.

What do you mean by that?

Well, just giving a test and getting the results back, that's not going to necessarily change things for the better. You've got to do other things. You've got to have opportunity to learn standards. You've got to have better funding. ... You don't have level playing fields. ... We're not going to solve the problem by pulling up the tree and looking at the roots every year and then planting it back again. ...

The commissioner of education in Massachusetts, David Driscoll, defends the use of the high-stakes MCAS test because, as he says and the law prescribes, kids have multiple opportunities. If the first time they don't make it, they have four more chances. And the last two are targeted to give them even a better shot at it.

But that's not what the professional standards, the [American Educational Research Association] standards, or the National Research Council, calls for. They call not only for multiple opportunities, but for multiple measures of the same construct. Not just repeating the same test four or five times. ...

Commissioner Driscoll is absolutely right. They do have these opportunities. We also know from the Texas data that a lot of kids don't stay around to exercise those opportunities -- they leave school. And that's a problem. There is a cost that we can only estimate. ... Even if they didn't get discouraged, it may be that, for some kids, they can't demonstrate what it is you want them to demonstrate on that mode of testing. In another mode of testing, they might very well be able to show you what it is you're looking for.

And so we need to try to get other indicators of what it is we are truly interested in, in addition to the multiple opportunities.

So, multiple opportunities and multiple [measures]? ... Multiple measures like what?

The MCAS isn't the only fourth-grade math test around. There are an awful lot of them around. Or for kids on the borderline, you can go in and get direct measures. ... [Y]ou might need to go in and give a kid a book and say, "Please read for me." Or you might, again, start to get teachers back into the process. ... They have a ton of information on kids. ...

The test companies, essentially, put a warning on these things. [They] say, "Don't use these for high stakes." But states do. How do you explain this?

It's a political question, and education has become a political issue. One of the things that legislators or governors can do is they can impose tests. And they don't have to worry about what goes on in classrooms; they don't have to get into the messy details. They get numbers out that are quantifiable. So it's very attractive, and it's cheap, relatively speaking.

Lots of money being spent.

... [G]iven the overall education budget, testing is a very small part. It's getting bigger, but it's a relatively small part. It's not nearly as expensive as equalizing funding, putting money into service training of teachers -- a whole series of things that would cost a lot more money.

As I say, you get quantitative results that can go in newspapers, and you can appear to be addressing the problem. And that's why I say you're not going to test your way out of the problem. ...

So the test companies, when they say, "Don't use this for high-stakes decisions," is that just being disingenuous? They're taking the money.

Sure, they're taking the money. ... This is the other thing that people don't fully understand. [In] other industrialized countries, testing is not a big business run by publishing companies. It's run by departments of education or by examination boards. ... [Here] it's a commercial enterprise.

Accountable to?

Whoever the person paying the bill is. ...

The tests themselves, I wonder how good they are. ... The other day a young student, a 10th-grader, found a mistake on the math tests. Someone else revealed that James Madison was identified as John Madison. Is that an area of concern?

Oh, sure. That happens. ... That's what I mean about the limitations of the technology. These kind of items are going to slip through. And it can become a serious issue, in some cases. ...

You make it sound as if these policies really do hurt a lot of kids.

They can, yes.

Is this mean-spirited people at work?

No, no, not at all. I think these people have what they think are the best interests of the kids at heart. It's just, as I said, we have our ideological/philosophical differences here about what is good, what should be done and what shouldn't be done. What I'm saying is that the technology that you're using to do these things has inherent limitations that we don't fully take into account. ...

Young Pete Peterson, the 10th grader who found [one of the errors on the MCAS], ... does that suggest that maybe there are other mistakes? Other bad questions?

Maybe not as blatant as that, but there may be other what we call ambiguous items, where A is the answer that they want, but B isn't that bad. ... And that ambiguity, again, masks the true ability level of the kid. The kid may have read something into the [question]. Remember, adults are the ones who write these items. Kids are the ones who answer them.

My favorite example of that is the famous cactus question. There was a third- or fourth-grade test. And they said, "Which of the following needs the least amount of water?" [There was a picture of a] cactus, they had a geranium [in a pot], and then they had a cabbage. And, of course, the adults wanted [the kids to answer] cactus. A number of kids picked the cabbage. And when asked why, they said because the cabbage was picked, it doesn't need water anymore. Perfectly, perfectly sensible choice for those kids, but they got marked wrong.

There's a lot we don't know about how young kids approach these items, what they read into these items. ...

You're talking about the bad questions and ambiguity. But the technocrats would say, "Well, yes, but we're getting better."

... In terms of the testing technology, we are still back in the Model-T era. It's basically the same technology that we've had. ... I think eight, 10 years down the road, when we find new ways to use the computer technology and we meld the two together -- and by that, I don't mean we just throw multiple choice items into a machine and have the kid do that -- I mean simulations. For example, one of the exciting things that we're working on is with doctors and medical simulations. It's a test ... but it has all kinds of potential for ... training. ...

Pilots take flight simulators. ... So I think that we're going to find more and more that K-12 testing eventually is going to go down that road. But right now, the technology is a Model-T technology. And we're very good at that Model-T technology, but it's not a Porsche.

The simulations that medical schools, the Army, other places use ... those are actually teaching. ... Whereas the [multiple-choice] tests we give now are ... not necessarily teaching.

No, they're not. And the other thing is the misinformation that the purpose of a test like the MCAS is for diagnosis. Again, let's do a medical analogy. You come in and you take a test in May, and I don't give you the results back until November. And then I don't even give them to the same doctor; I give them to another doctor. That's not diagnosis. If you want diagnostic information, you've got to give stuff with very fast turnaround and very fast feedback. And that's not what these tests do. These tests classify people -- that's what they do. That's not diagnosis. ...

There's a proposal [by] the president of the United States for expanded testing. Is this a good idea?

I don't think it is. I think that we have enough testing now. We know who the kids are that need help. We know the kids that aren't doing well. [Putting] another layer of testing on top of all that we have is, if nothing else, going to take away from instructional time. ...

Some say [this movement] goes way back to "A Nation at Risk" in 1983. ... Give me sense of context. ...

Well, I can go back to the 15th century in Italy, where the schoolmaster's salary depended on how kids did on a viva voce examination on the curriculum, which at that time was pretty much rhetoric. And up until the 19th century, you had payment by results in Australia, Jamaica, Great Britain, Ireland -- almost everywhere where the Brits went, except Scotland. ...

You had a performance contract here in the United States in the 1970s. You had minimum competency movement in the 1970s. This is not a new thing.

And in every single case, we know what the effects of those are. We know that teachers teach to the exam. ... Now, people say [that's fine], if you can have tests worth teaching for. Well, no test should replace a curriculum. ...

So this is an old [issue]?

... And it's predictable. We know that scores are low in the first few years of a testing program, and then they gradually go up as people catch on. Like in any public policy thing, you can corrupt the social indicator. Whether it's ambulance response time, on-time flight for airlines, arrest rates -- these indicators are corruptible and tests are corruptible. And you can have these scores go up, and not have the underlying learning that you seek to improve go up. There are wonderful examples of this going back a long, long time. ...

And I predict that with what's going on now, we'll start to implode. ... You're going to start to see it happen in states when large numbers of suburban kids don't do well. ...

There have been protests -- parents keeping their kids home in well-to-do communities out in California, Scarsdale, N.Y., and other places. Is that going to keep happening?

Yes. I think so.

There's also reports of teachers saying, "I don't want to teach in this [environment]," and leaving.

Well, that's one of the unintended consequences that needs to be documented. That may be an urban myth, but we need to know. ... You can have a very good goal [in] mind, but the unintended consequences of those goals very often come back and bite you.

home · no child left behind · challenge of standards · testing. teaching. learning?
introduction · in your state · parents' guide · producer's chat · interviews
video excerpts · discussion · tapes & transcripts · press
credits · privacy policy · FRONTLINE · wgbh · pbs online

SUPPORT PROVIDED BY