The Gates Foundation’s Measures of Effective Teaching (MET) project produced two culminating reports last year at a cost of f o r t y – f i v e m i l l i o n d o l l a r s. The goal was to test the evaluation of teachers by students and by “value”-“added” “metrics” (VAMs) and to promote the use of these evaluations combined with each other and with classroom observations (called composites when taken together in equal measure). The reports were the object of a careful and thorough demolition by Professors Jesse Rothstein of Cal Berkeley and William J Mathis of the University of Colorado. Allow me to conduct you briefly through the ruins.
• “[T]eachers whose prior year classrooms were especially high- or low-achieving were not included,” which means that the recommendation of a comprehensive policy rests on the findings of a non-comprehensive study.
• “Among the MET districts, the fraction of students who actually were taught by the teacher to whom they were randomly assigned ranged from about one-quarter to two-thirds,” which means that teachers’ VAMs were often based on the scores of students other than the ones they were assigned in the study.
• The study “presumes a single ‘general’ teaching quality factor” whose existence is questionable (see my posting on The Phantom VAMs), which means that VAM-based evaluation may be like that new set of clothes for the emperor.
• “The reports do not consider the substantial qualitative literature that considers teaching as a multidimensional enterprise that serves a variety of purposes beyond test-score improvement,” which means that it is as worthless to rely solely on measurement to determine teachers’ quality as it would be to rate the pictures in the Louvre using a scorecard.
• “[A]lternative, more conceptually demanding tests, which appear to capture an additional dimension of teacher effectiveness that is not measured by the regular state tests,” showed a correlation of 0.25 or less with results on the “composites.” This leads Rothstein and Mathis to warn that “[t]here is evidently a dimension of effectiveness that affects the conceptually demanding tests that is not well captured by any of the measures examined by the MET project.” Read Melville’s “The Whiteness of the Whale” and answer these questions: 1. Moby-Dick is A. white B. pink C. puce D. chartreuse E. black. 2. Whiteness is A. ambiguous B. troublesome C. horrifying D. polysemic E. mighty fine.
The critics’ conclusions:
• “There is no statistical procedure that can decide which dimensions should count more in teacher evaluations; that decision must be left to subjective judgments.”
• “[S]ome of the dimensions of effectiveness—those measured by student performance on alternative, open-response (i.e., not multiple choice) tests designed to capture higher-order thinking—are not at all well measured by the standardized measures that were the focus of the MET research.” [emphasis added]
• An evaluation system based on the MET measures “will likely discourage teachers from putting their efforts towards the kinds of learning that the alternative tests measure—in this case, higher-order, conceptual thinking.”
• “There are a number of ways for a teacher to influence her VA score other than through actually becoming more effective: She can arrange to be assigned a different set of students; she can reduce the amount of class time that she devotes to non-tested subjects (e.g., history) in order to allocate more time to the subjects covered by the tests; she can teach test-taking skills in place of substantive knowledge; and she can try to rally students to give their best effort on a test which is usually, from the student’s perspective, entirely meaningless.”
And do bonuses for good scores result in good scores? Rothstein and Mathis note that experimental analysis of “pay-for-performance” programs in the U.S. shows that “teachers eligible for bonuses do not produce higher value-added scores—let alone better teaching along other dimensions—than do control-group teachers who are ineligible for bonuses.”
A friend of mine has done some research, not yet published, that casts another shadow over the use of one of the MET “composites.” It reveals that university students often throw their “ratings” of teachers in favor of those who give light work and against those who make difficult or troublesome demands of them. The problem is that improvement in critical thinking has been shown to coincide with taking reading- and writing-intensive courses and studying under teachers with high expectations.
Not that the students themselves are always good judges of their own proficiency. My friend’s data show that more than half of the students surveyed think themselves proficient at reading scholarly work, but their teachers are almost unanimous in claiming that they do not possess this power. And what do these same students think is an unreasonable expectation of work assigned? The unpublished results say that 90% of them find 40 – 50 pages a week for a 3-credit university course “too much reading.” It is a sorry reflection on their expectations that some Cliff’s Notes would be counted as excessive reading, and it is a sorry reflection on the abdication of sound judgment by the Ed Biz to realize that these same students’ ratings contribute at some schools and universities to corrupted decision-making about their teachers’ continuing to work.