Failure by the Numbers

As New York City prepares under a new mayor to make some basic changes in its “system” of holding schools accountable for their students’ performance, a few observations are in order. The first is that the “system” has generally been a failure. This much is admitted even by The New York Times, a qualified admirer of the “system,” whose editors say that only one in four of New York’s students now meet the Regents’ standard of “college readiness”—this although the “system” has been “in place” for a number of years. Amazingly, they task the new mayor with “bashing his predecessor” about these results. What should the new mayor have done? Thank him?

I would say not, after reading a report by the city’s own Independent Budget Office on the program. It is true that the IBO issues this faint praise of the “system”: that it “is a significant improvement on accountability methods based solely on standardized test scores.” The “system’s” first problem, of course, is that it used to be the “unimproved” kind—a kind that people of good educational judgment immediately recognized as worthless and harmful, an opinion later confirmed by research. But good educational judgment is what the businessmen and lawyers appointed by the old mayor lacked, as how could they not? And the “unimproved” “system” was never completely replaced.

The second problem is that this “significantly” improved “system” is itself not very impressive. I mean not just in that it has left three-quarters of New York’s students unprepared for college, but that its ostensible basis in statistical rigor ends up not meaning much.

Example 1: In the five years from 2006 to 2011, five “grades” of A to F were issued to nearly a thousand elementary & middle schools. Table 6 of the report shows that fifty-eight percent of these schools received three or more different grades in successive years. This extraordinary volatility of results was due not to successions of genius phases and vegetative states, but, as the report said, to the volatility of the method used to get them. The report specially noted that because of its volatility, the “system” had a very difficult time distinguishing the passing grade of C from the failing grade of D.

Example 2: The volatility was so extreme in some cases that schools changed by two or more letters in successive years[1]. The “system’s” answer to these changes eventually was to disregard them when the change was for the worse, and to accept them when it was for the better. As the IBO’s report drily notes, “observing such volatility should lower one’s confidence that the measure is capturing systematic rather than spurious differences between schools.” Indeed.

Example 3: The “peer-related” school evaluations, which supposedly rank like schools against like, are based on a formula that weighs the number of black and Hispanic children as 30% of its total and assigns the same weight to students with special needs (as signified by their having an Individual Education Plan (IEP))[2]. Thus are a school’s “peers” found. In the IBO report’s data we find negative correlations between test scores and numbers of students in these two groups. That is, more of these students mean lower scores. However, the coefficients of correlation, unlike the weightings in the formula, show that the components have rather different effects on the final numbers: -0.114 for black and Hispanic students, but -0.457 for students with special needs. A couple of things jump out at the reader: -0.114 is a rather weak correlation on which to base a third of a formula that can lead to consequential decisions, and equal weighting of factors with such different correlations seems evidently flawed. What is more, the table from which these data are taken shows that in many of the cells, the numbers lack an acceptable level of statistical significance (16 out of 40, or 40%).

Example 4: In “peer groups” of schools with like “results” in this demographic crap shoot, particular schools whose statistical “proximity to the horizon” of that group was more than two standard deviations from the mean were ignored as outliers. Teachers of a certain age will remember Jaime Escalante, the gifted Bolivian who taught at Garfield High School in East Los Angeles, helping to turn it for a few years into a comparative powerhouse of college preparation even among students who did not pass the AP calculus test. The difference between judgment and statistics is the difference between the movie Stand and Deliver and the blank screen of an outlier. Let the inquirer who favors judgment ask Escalante for his secret. He said, “The key to my success with youngsters is a very simple and time-honored tradition: hard work for teacher and student alike.”

Speaking of teachers: the last but not least problem, in the “value”-“added” teacher “metrics” that go along with the “system’s” school ratings, is that “value added is difficult to predict based on teacher observables.” So we have a “system” of school ratings that do not strongly correlate even with the questionable data that they comprise, and we have a system of rating teachers with a weak connection to what they actually do. Bashing seems in order to me.

[1] Atlanta’s schools had such extraordinary changes, too, though we are now learning that those results were not due to statistical volatility.

[2] The other two factors are the number of students eligible for subsidized lunches (30%) and English language learners (10%).

Leave a Reply