When Numbers Might Not Tell the Tale

Reading Malcolm Gladwell is always interesting. A few years ago he wrote a piece[1] for The New Yorker on how difficult it is to anticipate which teacher-candidates will turn out to be good teachers, suggesting that the ways typically used to determine potential quality are not very useful. In that same article he gave a qualified endorsement of “value-added learning” calculations as useful in distinguishing excellent teachers from dreadful ones. The qualification was that it is a crude tool not meant for fine distinctions such as those regularly made by the testing-and-accountability people.

My own experience with “competency testing” using home-grown instruments tells me that it requires painstaking construction of tests, careful analysis, and great modesty in drawing conclusions. An example of a modest conclusion would be that “the 10^th-graders are not doing as badly on writing as they did last year, so the remediation we adopted might be having a good effect. Let’s take out the scores of transfer students to see whether the improvement holds up and then try to decide whether we can take credit or whether the 9^th-graders are just growing up.” Another would be that “the 10^th-graders are having more trouble making abstract inferences and putting them in good language.” Is there a problem in the program as a whole? In a part of it, for example, the way in which vocabulary is taught? In the development of the 10^th-graders’ abilities in abstract operational thinking? In our recent subscription to turnitin.com’s plagiarism-detection service?

In short, using our home-grown instruments required nearly as much judgment and subtlety as not using them, and we never, ever believed or asserted that the numbers alone told the tale. We did claim that they might be crude indicators of gross competence or incompetence or of trends in whole grades, but beyond that we did not want to venture. In my experience, the makers of standardized tests also offer caveats with their scores: there is an x % chance of unreliability; this 800-point test is valid within 50 points of the actual score obtained, etc. We successfully resisted suggestions by administrators working on their Ed.D.s in correspondence school that we might use them as the basis of a value-added learning program. And we were fortunate that our school operated at that time on a collegial, not a bureaucratic-authoritarian, model of administration.

In a collegial system an educational leader brings others along; in the bureaucratic-authoritarian system, he prods or lashes them forward with threats and menaces instead of good sense, discussion, and generosity. In a collegial system the “crooked timber of humanity” retains a knotty integrity; in the bureaucratic-authoritarian system it is turned into lumber, planed into shavings, or thrown away. It is the system most likely to like a one-test-fits-all “solution” to educational “problems.”

Gladwell recently wrote another article[2] about the U. S. News rankings of American colleges and universities. The thesis of the article is that the ratings are shot through with unexamined and unwarranted assumptions that skew results—that do more than skew results: they “validate” sometimes invidious or adventitious distinctions between colleges. To illustrate, he gave the list of the “top ten” law schools: U. of Chicago, Yale, Harvard, Stanford, Columbia, Northwestern, Cornell, Penn, NYU, and Cal Berkeley. He explained the calculations that lay behind these rankings and then made plausible alterations in the basis of calculation. The new “top ten”? U. of Chicago, BYU, Harvard, Yale, U. of Texas, U. Va, U. of Colorado, U. of Alabama, Stanford, and Penn.

Quite a change, and quite a plausible one. Yet another set of numbers, used by the Wall Street Journal and based on corporate recruiters’ opinions of the schools where they actually hire their rookies, identified the number one school as Penn State. Gladwell suggests that ratings of this kind may not be very solid, very accurate, or very objective.

One of the intellectual vices Gladwell notes in his article is the use of “proxy” values that can be easily quantified in place of values that are impossible to measure (but not, I would add, to judge). The main problem with the use of proxy values is that the basis for the use of the proxy is usually nothing but an unexamined assertion of its equivalence by the person using it. Since these unexamined assertions are not subject to the usual checks on misjudgment that open discussion or a good old-fashioned give-and-take can provide, they continue unexamined, transmitting to their numbers all sorts of crypto-subjectivity.

That being the case, users of such numbers ought to be cautious in the conclusions they draw from them, but that is not what is happening. People accept these ratings and the implicit fine distinctions as valid. A similar error of judgment marks the acceptance by testing-and-accountability people of the fine distinctions between competent and incompetent teachers or between good and poor schools. A moment’s thought should tell us that it is ridiculous to judge the quality of a school by examining the results of its students on two multiple-choice exams administered once a year. That sort of crude proctology by statistics should have no place in the evaluation of schools or of teachers.

In future postings I hope to suggest what might do instead.

[1] Called “How Do We Hire When We Can’t Tell Who’s Right for the Job?” in his recent book What the Dog Saw

[2] “The Order of Things: What college rankings really tell us” in The New Yorker, Feb. 14 & 21, 2011.

Leave a Reply