But many people might have [said] that the tests sample the most important
parts.
Not necessarily the most important parts. For example, national achievement
tests, standardized tests, are going to measure those things that are most
common within a particular grade level. So it's going to measure the most
common things that are taught to fourth graders and fifth graders in certain
areas. It may be very, very important to teach students how to actually set up
an experiment. That's a major goal of science. ... I can't measure that very
well with a multiple-choice type test. Most of the national achievement tests
that are being used are of multiple-choice formats, so they narrowly have to go
in and measure those things that are best suited to a multiple-choice format.
And what's best suited for a multiple-choice format?
Well, it's going to depend. A lot of people have an argument that
multiple-choice tests only measure basic skills, that we cannot measure the
student's ability to evaluate. When you think of a taxonomy of behaviors,
that's not true. You can measure evaluation; higher-order skills can also be
measured with multiple choice. But I can't measure per se the student's ability
to produce something. I can measure underlying skills that may be associated
with that, but not their ability to actually produce a product. ...
Why do we use multiple choice so much?
... The best [argument for using] multiple choice would be [that] it's an
efficient method to measure certain skills, certain knowledge. There are other
ways of measuring different skills, different knowledge. So you could use a
performance assessment to pick up other types of skills [and] combine the
information with the information you gather from a multiple-choice [test]. The reason
we tend to go first to a multiple choice is practical: I can measure a lot of
things very efficiently, very quickly, and fairly cheap. I can't do that with a
performance assessment.
So what important thing can you not measure in the multiple-choice format?
Writing. Direct samples of writing. ... When I look at young children, there
are other types of behaviors that are related to learning that I can't measure
with a paper-pencil type instrument.
Such as?
Well, staying on task. Whether or not a student actually asks questions.
Whether they use background information appropriately. Whether they're part of
a conversation. Those are all very important things to know about ... young
learners. But I can't pick it up with a paper-pencil test. So you want multiple
pieces of information from multiple sources. ...
At the beginning, you said that the national achievement tests ask what's in
common. Is what's in common universal?
Oh, no. Absolutely not. You're going to have some pieces of the curriculum that
are pretty standard at particular grade levels. There are certain things that
fourth graders will learn in mathematics. When I think about social studies,
the first thing most students learn in social studies is the notion of family.
Then you move from family to neighborhood. You move from neighborhood, broader,
to city. Those pieces are common. But when I think about the different ways
that a textbook might present that, or curriculum material might present that,
that's going to vary. When you look at the national tests, what they've tried
to do is to find those pieces that are most common, recognizing it's not going
to be a perfect match for any given school district.
How much do they miss?
Depends on the school. Any state that's looking at a particular achievement
test [to] use has to have a responsibility. They need to start with their
curriculum, because those educators in that system have decided [that] these
are the most important things for [their] students to learn. ... [T]hey should
compare [the curriculum] to what's being measured by each of the possible
competing national achievement test. And you pick the best match.
But if you just pick a test off the shelf, what's the risk you're taking?
You're taking a fairly big risk of mismatch, without looking at the test. I
would think you're being very irresponsible.
How many choices do you have?
There are [five] major achievement tests. There are lots of different ways to
create customized tests, depending upon which vendor you go to, that might
better measure your local curriculum. ...
Doesn't sound like a whole lot of choices.
Probably not. When I look at choice within educational testing, you're looking
at a market. Like anything else, [it's] a private market. Probably 10 years ago
when tests were used slightly differently, there wasn't a heavy emphasis on a
single test result. Those ... tests, used in combination with other sources of
information, probably did a very, very good job of providing external
information to the schools.
If I'm now in a situation where I'm saying, "I've got to choose a test that's
going to be used to make very high-stakes decisions either about the school or
the student," sure, you want to find the best possible match. And [five] tests
are going to leave you wanting. ... I think you're going to see new players
coming into the market of developing standardized tests. Right now, with Bush's
proposal, with the state mandates on accountability, there's a very, very high
demand for tests. And the major publishers in the market are not going to be
able to handle the demand. It's too much.
... So when I pick up my newspaper at home and I look at the chart that
ranks all the schools according to test scores and I see a school at the
bottom, I think, "Gee, that must be a bad school." What do you say to
me?
I say you could have ranked those schools without test results and gotten the
same thing. I would tell you that ranking of schools is really a very unfair
comparison of schools.
So I'm not right to think that?
No, you're not necessarily right. You may not be wrong. If I look at the
factors that impact achievement, it is not as if all schools start off equal.
No question. Most of the communities in the U.S., you have some degree of
segregation, at least by socioeconomic standards. So you're going to have your
very affluent schools, you're going to have more of your middle-class schools,
and this is based upon neighborhood. And you're going to have lower-income
schools. We know there's a very high correlation between income and
achievement.
So if I think about just the median income level of the families in the school,
I know some schools at the bottom are starting with some of the weakest
students. And when I look at schools starting with weak students, in order to
help those students get up to now this notion of one common standard, it's
going to take more than a school that's starting at the top. ...
[I]f you have a kindergarten child that enters school that doesn't know their
alphabet, they can't count [to] 10, you're going to first have to teach that
student how to do that. ... [You] have to establish some pretty basic skills
that aren't in place, and in other students, they are there. And at the
end of kindergarten, [if] I want to get this student up to where maybe they
have beginning reading skills in place, it's going to take more time. ...
And at the end of that first grade, if that student learns the alphabet,
learns some reading skills but still scores poorly on the test, is that a
reflection of poor teaching?
No, not necessarily. ... If you look at the growth of the student, from where
they started to where they finish at the end of the school year, that could be
an outstanding school.
Do the tests tell you that information?
No piece of information tells you that in isolation. None. So what you're
looking at is teacher grades or test scores. ... We need to take all the
pieces. I need some idea of what the kid is bringing to me when they come in
the door, so I have to have some early informal assessment or formal assessment
of where the student is. ...
When I think about what a school should do or what makes a school good or bad,
it's far more than how well they teach a few sets of skills. When we think
about what we've [asked] schools and teachers to do, it's the developmental
growth of these little five-year-olds and six-year-olds that come in a door,
all the way to being productive citizens, able to go to college or able to go
on to work. And if I think about, what does that mean, it truly has to be more
than just achievement in a few basic areas.
One, I want them to develop an appreciation and value of societal norms. Being
a good citizen. Caring about learning. Contributing to your community. I can't
measure that with a test. When I think about what makes a good school, I want
to think about, do teachers have the commitment to help a student who may not
be the traditional student, who needs a little bit more? ... A test score
doesn't capture that. ...
So if I were to try to say, "OK, all of that sounds wonderful. And I want the
school to do it," but all that I can easily measure is your achievement over
some basic skills, I've cheated the schools. ...
[S]ay that student scores in the in the 5th percentile [of all the students
who take a test]. How fair is that to the teacher who's being rated according
to where the kids stand?
My opinion [is that] it would never be fair to rate a teacher whether the kid's
at the 5th percentile or the 95th percentile. My rating on that teacher should
be based upon more information than an achievement test [can] score. So I don't
think it's fair. ...
And yet [students' scores are used to measure teacher performance]. We make
this huge leap of faith.
We sure do. And it's because we're comfortable with this objective test
scoring. It's safe. We think we know what it means. And we think it can stand
in a proxy to the value of the whole, which is wrong. The inference is just
wrong. Possibly, it's our fault as educators because, in the past, we haven't
wanted to be accountable. We've avoided this notion of testing or accountability,
or what do I do with my day. University professors have it, too. We've given
the public a sense we're hiding.
So I think a lot of it is lack of communication, an unwillingness to accept
when the public says they're not happy with us. They want more. They want
schools to do something different. A lot of times, we run in fear, rather than
trying to sit down and communicate: "This is what we do. We don't have a good
way to show you that we're doing it. What would be better?" ...
What is a "norm-referenced" test?
... [If] I want a national norm reference test, I'm going to get a
representative sample of students in the nation and I'm going to give them this
one test that represents common curriculum. ... The scores that are going to be
yielded are scores that give me information of how children are doing relative
to each other. So relative to third graders, how [does this particular third
grader] look? ...
[Percentiles, that's one way to score a norm-referenced test. That is, if
you're a student who scored in the 5th percentile, you scored better than 5
percent of the other students who took the test. And 95 percent of the students
who took the test outperformed you]. ... What does a norm-referenced percentile
score tell you about whether or not the child has met standards for learning?
If you were in a situation that you were standards-based -- and your main focus
was, has the student learned the standard? You wouldn't use a percentile rank,
or a percentile. ...
A lot of educators in the public, we've gotten comfortable with the [percentile
scores and the] comparison to others. [We can say], "I don't have a firm way to
defend any given standard, but I do know that compared to other third graders,
you're performing well."...
[So it's not appropriate to use norm-referenced tests that are scored using
percentiles. But what if a norm-referenced test is scored differently? Is it
the right test to use for measuring whether a student has met the learning
standards?]
It may or may not be. ... The best test would be a test that's built
specifically to measure the curriculum of your school, a test that has enough
items measuring any given objective that you can make a trustworthy decision
that the child has either achieved the standard or not. That, in all
probability, would lean more towards what we think of as "criterion reference,"
as opposed to norm reference, because the emphasis in norm reference is, how do
you perform relative to someone else? ...
[With criterion-referenced tests], it's not how your third grader does
relative to other third graders. It's, "Does your third grader know what we've
deemed to be enough? Do they know enough to be called excellent, or basic, or
needs improvement?" But how do you decide?
Real tough. With the criterion reference ... you can define a set of material
that students should learn. So within reading -- which has got to be the
hardest one in the world to even think about setting a standard -- I have a set
of objectives and content and types of material students should be able to read
in fifth grade. Now, at some point, I have to decide how much is enough. And in
order for me to say that you are a good reader, let alone an excellent reader,
someone has to come in and set a [score], and decide in order for you to earn
this label "Good," you have to answer X number of questions in the reading
section correctly. The standard is set subjectively. There is no other way to
set a standard. ... Somebody or some groups of bodies -- educators,
politicians, businessmen -- have to decide how much is enough. ...
Another way that you can actually set a standard is to take groups of students
that you have labeled as excellent, good, poor, what have you, and look at
their actual performance on that test and find the breaks and the distribution
and set the cuts there. That's one way you can do it.
Or if you want to raise the bar, which is what we're doing now, it's not so
much what students actually can do, but it's what we believe they should do.
And I can set a standard based upon my beliefs of what students should do.
That's what NAEP does.
At the state level, expectations of what students should be able to do ...
How often are [those expectations] pie in the sky versus realistic?
They're most often, in my opinion, pie in the sky. The reality ... come[s] down
to the decision, how many students can you afford to fail? How many students
can you afford to remediate? Because if they don't reach the standard, what do
you do with them? Do we just throw away generations of students? So when you're
setting this standard ... you have to look at the outcome too. ...
Well, if they ask that question, isn't there some incentive to [say], "I
don't want to fail a lot, it's going to make me unpopular?" ...
There are certain states where projected failure rates right now in their new
standards are 50 percent. And we have commissioners and politicians who are
saying, "That's OK. I'm going to live with that 50 percent now, until people
understand it's a new day and we have to get the students up to speed." I don't
know how you live with failing 50 percent of your students. I don't know how
you really can say that 50 percent of your students are not performing to a
standard that's been subjectively set. ...
States may set the bar high. Does everyone have an equal chance of getting
over the bar today?
No, absolutely not. ... But what do you do with the bar [in states with high
failing rates]? The only way you're going to change that is to drop the bar.
You could say, "We'll get a qualified teacher in every classroom."
Oh, absolutely. I wish we would say that. I wish we would say we're going to
make sure all students have a qualified teacher, [that] all students have the
resources necessary to learn, all students have access to a computer. School's
open, you can go to school, it's a safe zone. ... But all students don't have
those opportunities. ...
[With norm-referenced tests], we're comparing one third grader to another
third grader. You have 40 questions roughly, and 40 minutes in which to create
[statistical variation between the students -- the "bell curve."] How do you do
that? ...
Before I pick those 40 questions, I have a curriculum. And let's say I write
120 to 150 questions. I write far more than what I'm going to use. I know that
I have to have questions to challenge the brightest students, I have to have
questions to challenge those students in the middle, and I have to have easy
questions to challenge those students on the bottom. ...
Someone said to me, "On norm-reference tests, national achievement tests, if
a teacher does a stellar job teaching content, it's very likely that that won't
appear on the test, because too many students will get that question right."
Say a whole school system focuses on punctuation, so that [the students all]
know their punctuation. Can you ask a question about that?
Absolutely.
And what if they all got it right? ... Would you throw out that
question?
No. ... When I'm building [a] test of basic skills, I'm going to draw a
representative sample of schools and classrooms from across the country. School
district X in the Northwest, teacher Jones may in fact have done an excellent
job of teaching that particular objective of punctuation. But the rest of the
country may not be doing that. So that when I look at the average difficulty
level, it's not going to reflect that everyone's doing the same thing. ...
How much material on a norm-reference test is at grade level?
That's tough. The majority would be what you would think of as grade level. But
if you go into a fourth-grade classroom, all students are not performing at one
point. Some fourth graders are working on material a little bit above, some
fourth graders are working on material a little bit below. Typically, if you
want to think about a test, some of the easier questions for fourth grade may
in fact be material that the average or above average third grader can do. So
that would be off level for fourth grade. But it would challenge those least
able fourth-grade students. ...
So what is the difference if you were, say, writing a norm-reference test
for fourth grade [math] and a criterion-reference test of fourth-grade math?
Would you take a different approach to the questions you write in some way? How
would the test writer approach it differently?
I wouldn't take a different approach to the question I write. But I would write
more questions first, to cover a particular objective, and the manner in which
I selected the questions to include on the test would be different.
How would it be different?
... I wouldn't have to worry about having a set of items that challenges the
very bottom, or the very top, or the majority in the middle. My focus should
purely be on covering the material. And if all students can get it right, so be
it. [But] if I have absolutely no variability, if every student got every
single question right, I don't have any information. So what I would do on the
criterion-reference test is really try to have questions that spread students
out from those who are good, versus not good, or on level, proficient, not
proficient.
... All the public really wants to know is, "Has my kid learned third-grade
material?" ...
What about those third graders who are already beyond third grade? They're
still in a third-grade classroom. A good teacher has to be able to structure
instruction in such a way that she can challenge the range of ability. If she
really has a third grader who's very, very bright, who learns very, very
quickly, she has to have a tool that allows her to challenge that student, to
push them on. We don't want her to just stop and [say], "You're done for the
year." You want to keep them learning. The most common way to do that is to
reach to the grade level above. ...
So the test then would help the teacher do a better job in that sense?
That's why we originally thought we were building these things, and not [for]
accountability. Absolutely. ... I build tests to provide instructional
information to help a student learn.
... If you built it to help the teacher, how has it been twisted? ...
It has been viewed by policymakers as a cheap, efficient, external tool that I
can mandate onto a school or a system, get a magical number, to see how well
you're doing. And I can hold you, as a district, accountable. When the public
became dissatisfied with what students knew, what they were learning, what our
schools were doing, everyone in a good faith effort reached for a fix. No one
really knows how to fix public education. But one very easy way of at least
indicating a problem, or strength, is through a number. It's objective. ...
There are a lot of people who look at the test scores and they say, "Oh, my
child got a 75." ... Is a 75 always a 75?
Absolutely not. That score of 75 contains some amount of error. What I want to
do is try to make sure it doesn't contain a lot of error. You happen to take
the test at 10 a.m. Well, maybe you would have done differently at 12 p.m.
...
What is the standard measurement error in the history portion [of the
eighth-grade ITBS]? ...
I can't give you a standard error off the top of my head. ... It's going to be
based on a different test and a different score scale and what have you. For
example, hypothetically, let's assume the standard error for scores near the
center (because [standard error] depends upon where the score is) might be five
score points. So if you have a score of 75 ... I know that 75 is not an
accurate pinpoint estimate. There's some amount of error in there. And what I'm
saying with this notion of a standard error [is that] ... if I could measure your
true ability with no error, your score could be captured within the range
of 70 to 80. ... So it's the notion of providing a range within which I believe
your true ability lies.
Parents and policymakers. How much do they know about that [range when
they] look at tests?
Very little. They believe the score you get is accurate and perfect. They have
very, very little sense of understanding of error, which is a problem. ... [I]f
I'm one point below a cut score, and I have a standard error of measurement
that's plus or minus three points, it's very likely that my true ability is
above the cut. And if you have absolutely no knowledge of what a standard error
means, you're going to assume you have perfect measurement and fail the
student. It's a problem.
Why do you think people are so ignorant about tests?
Think about just the label of people who [devise] tests. The professional label
is a psychometrician. ... It sounds nice, magical. You don't have a clue what
it means. It's a field where very little people know about it. ... And it's a
field where we haven't done a good job of educating the public.
We build all sorts of interpretive material to support our tests. They cost a
lot of money. So rather than going out and doing a better job ... of educating
people -- this is what you should use the tests for, this is the type of
information or decision it best supports -- I think we have to blame ourselves
for not finding a way to communicate.
Part of the problem could be that, as a parent, I can't see the test.
That hasn't always been the case. There were times when the teachers could see
the tests. The parents could see the tests. The tests are now a mystery.
They're a mystery because we have high stakes attached with the outcome. We're
holding folks accountable. You put high stakes on it [and] the pressures to cheat,
the pressures to gain an advantage are so high that we're now just putting the
test behind this shield. So I agree -- part of the problem is, you don't have a
clue what we're testing.
So now will this high stakes direction that we're going in shroud tests
further, do you think?
I'm sure, to some degree, yes. And to others, no. If you actually see movement
on the federal level to produce a national test, it's going to be under a deep
layer of secrecy until it's released, until the students take it. The minute
they take it and it's over, release the test. Let everyone know: This is what
was out there. This is federal money.
There are only two states now, as far as I understand it, who actually
release complete tests. Why don't more states do that?
The cost of building these tests is enormous. The amount of time that's
committed is enormous. If you look at something like the ACT or the SAT, where
you're building one test on one level ... you're only measuring four or five
areas. And you're charging students $25, $30 each to take it. With that source
of income, I can replace the test completely every year.
If you look at an achievement battery that's kindergarten through the eighth
grade, there are only two forms. They're developed every seven years, on
average. And you have 11 to 13 different content areas. The average student in
the country may be paying, I don't know what it is now, $6, $7 per kid. It
could be $10, $12. You can't afford to release and start over. So from a
commercial standpoint, the tests that are being used in the current movement
are too expensive to redo. The resources aren't there.
Or we're not just willing. They may not be too expensive. It may be, we're
just too cheap.
Oh, I don't think we're too cheap. I think there are lots of other things that
our educational funds should go to rather than paying to build a new test. I'd
rather put the money on teacher training, put the money into computers. The
emphasis on testing is far too high. If we have extra resources, additional
tests aren't going to make the difference. The difference is going to come from
what the teacher does in the classroom. So give the money to the classroom
teachers, to help them.
How appropriate is it to base a graduation requirement on a test score
[alone]? ...
Inappropriate. Completely inappropriate. ...
The notion of [a test's] validity is tied to a specific test use. Is the test
valid for supporting the types of inferences you're trying to make from a given
test use? ... Do I have information to support that my test score or my test
can yield a set of scores that you can use to make graduation decisions? That's
a notion of validity. ... Most of the tests that are being used now, I don't
have information that would support that I could make a graduation decision,
based upon a test score. ...
Now, the community of test makers seems to [be] in fairly strong agreement
on that. ... Why don't people listen to you?
If you now challenge any of the policymakers, they will tell you they're not
using a test as the sole source of information. They're using multiple
measures, they're taking into consideration teacher grades, other types of
pieces of information. I can't tell you what they all are. ... I have no reason
to believe they're lying to us. So I'm assuming they're starting to use other
pieces of information.
Well, in Massachusetts, you have to pass a test to get a high school
diploma.
Right. But that's not the only thing you have to do. The problem is, the test
is the final gatekeeper. So regardless of all the other pieces of information,
this test is still serving as the final single source. We would say that's
inappropriate. I can't tell you why they're not listening. I wish we knew.
Maybe we haven't communicated strongly enough. Maybe we don't know or have the
tools necessary to enforce appropriate use of tests. I don't know.
You make these tests and you turn them over and you have no say over [how]
they're used?
Which is pretty sad. ...
You hear a lot of politicians these days and business leaders [who say] ...
"I'm for tough accountability. Let's test them and make sure they know [the
material]. Let's expand testing in more grades." It's as if you could pull
these tests somehow out of thin air. I'm wondering, do these people have the
realistic notion of how long it takes and what's involved in writing a
test?
No. Absolutely not. ... We actually spend seven years developing new forms [for
the ITBS]. ... We're constantly making changes. We're trying out items all the
time. It's extremely expensive and difficult to get the cooperation from
schools across the country to try out the test questions.
So when you hear President Bush say, "Next year, I would like to mandate
testing in grades [3] through 8 in ... reading and math, and then we'll expand
it," ... what goes through your head?
It's nuts. ... It's not going happen. It's an impossible task. That can happen
if it's planned in time. The biggest piece is, what is the information going to
be used for? Bush's plan actually allows the district or state to choose or
design any assessment of their choice. ... We do not have the capability of
scoring, [of] producing reliable, good reports in a timely manner, if that plan
goes through. The resources aren't there. ... For me, it makes me believe it's
going to lead to a federal national testing plan where I'm going mandate a
particular test for all kids. And that's a scary thought. ...
home · no child left behind · challenge of standards · testing. teaching. learning?
introduction · in your state · parents' guide · producer's chat · interviews
video excerpts · discussion · tapes & transcripts · press
credits · privacy policy · FRONTLINE · wgbh · pbs online
some photographs ©2002 getty images all rights reserved
web site copyright WGBH educational foundation
|