Banesh Hoffmann, a colleague of Einstein’s in Princeton, wrote a book in 1962 called The Tyranny of Testing. In it, he offered a detailed and wide-ranging critique of multiple-choice testing as a basis of determining “intelligence” and “scholastic aptitude.” He couldn’t have known that such tests would also eventually be used to determine “teacher effectiveness,” though he would surely have been astonished, if not outraged.
Anyone who has read Maria Ruiz-Primo’s study of the usefulness of different kinds of test to capture data on the “opportunity to learn” (OTL) knows that big standardized tests are the worst kind to use. That is doubtlessly due to limitations in how they reveal what students actually know; but their weaknesses can be pinpointed, as Dr. Hoffmann notes. Here are the weaknesses:
• They deny the creative person a significant opportunity to demonstrate his creativity, and favor the shrewd and facile candidate over the one who has something of his own to say.
• They penalize the candidate who perceives subtle points unnoticed by less able people, including the test-makers.
• They are apt to to be superficial and intellectually dishonest, with questions made artificially difficult by means of ambiguity because genuinely searching questions do not fit into the multiple-choice format.
• They take account only of the choice of answer and not of the quality of thought that led to the answer.
• They neglect skill in disciplined expression.
• They have a pernicious effect on education and the recognition of merit.
A new point may now be added, as the SAT plans to do away with its penalty for wrong answers (which raises a problem I have dealt with):
• They reward successful guessing as well as thinking.
There is also a statistical problem with the tests. Hoffmann points out that the SAT’s ability to determine “scholastic aptitude” turned out to be about as statistically valid as determining a group of people’s heights by taking their weights. If what we needed to know is that heavy people tend to be taller than light people, we could have achieved the confidence through measurement that a moment’s critical thinking gives us; but that is not what we need to know. We need to know just how apt or learned or intelligent individual people are. That is more troubling.
The problem with the SAT is, of course, that even its promoters now admit it does not accurately measure aptitude, and for some years have let it be known as just “the SAT.” Those letters stand for the nothing the test actually measures, but they have been kept because of their value in branding. One is tempted to say that some value is better than none, but in this case it is actually worse than none. In a parallel argument, Hoffmann shows that IQ tests do not measure intelligence (something we also know by reading Stephen Jay Gould). Instead, they measure a phantom “quantity” that Hoffmann, tongue in cheek, calls iquination—the ability to do well on IQ tests. What should we call the parallel phantom ability to do well on the SAT? I would propose satiation except that it already has a legitimate meaning and does not need an illegitimate one.
A third part of Hoffmann’s argument against the multiple-choice test is that the weaknesses of its questions show that “it is not designed to test deeply what it is designed to test superficially.” Test-makers also use ambiguity as a proxy quality for difficulty. There are many examples of weak questions; I choose at random one of Hoffmann’s examples of an authentic SAT question, in which the underlined part must either be left unchanged or “corrected”:
Cod-liver oil is very good for children. It gives them vitamins they might otherwise not get. (1) NO CHANGE (2) , it (3) , for it (4) ; for it
Hoffmann notes that this question was rated as “easy” by the College Board, but anyone with an ear for the rhythms of English, including the English professors he consulted, would say that it is impossible to choose between (1) and (3) without having a context for the sentence(s). A facile test-taker, on the other hand, will quickly figure out that it must be either (3) or (4) and that (4) would break Strunk & White’s Rules, which leaves (3). Some aptitude, I mean satiation!
The Metrics Claque now hooting up “objective” “measurement” of teachers’ “effectiveness” proposes to expand the use of such tests to rate how good individual teachers are. Remember: First, what they measure in teacher “effectiveness” is a phantom entity. Second, they are not very effective at measuring it. Third, their questions penalize subtlety and discernment, and reward guessing and second-guessing. The Metrics Claque should long since have been shown the door. Instead, it seems poised to bring a new phantom into the gray pantheon. Let us call it teviation (for Teacher Effectiveness Valuation) and its proponents Teviants.