Grading Parties

The school where I had my flexible classroom gave us much room for movement in other things too. In general, the principle was to trust the teachers’ professional judgment, and in general the principle worked. One of the things the English department decided to do was to implement an in-house testing program using homemade “instruments,” i.e. tests, to size up the proficiency of our students.

We called them simply the Reading Assessment and the Writing Assessment. In one the students answered open-ended questions about a passage they had read; in the other, they wrote an essay on a prompt devised by the teachers. What interests me about these tests in retrospect, particularly in light of all the hubbub about high-stakes testing, was how little we thought along the lines laid out by the high-stakes testing people.

We had two purposes in giving the tests. One was to see whether, in general, we could identify problems and strengths in classes of students in order to shape what we taught and how we taught it. The other was to give us a chance to work together at grading the assessments, thereby achieving a broad consensus on marking.

It’s a good thing we didn’t give these tests for “high stakes.”  Many years an insurgency developed among the students whereby they would throw their tests or otherwise rebel against taking them. Such insurgencies were rarely very big, but we usually ended up knowing what was happening and how they might affect results. But one year a particularly charismatic student managed to get a fairly widespread test-rebellion going. We found that in the class concerned, the tests were unusually poor, taken as a whole.

One set of ruined results did not have any dire consequences: we just marked them as we had planned and then discussed how we could head off test-rebelliousness in the future. But I wonder now what would have happened in a setting where such tests had high-stakes consequences for teachers’ raises or even their future employment. I have never heard of research being done on the problem of throwing tests even though such  mischief is a credible threat to the integrity of test results.

The way we marked the tests was to call long meetings for such purposes, which we called “grading parties.” (Whoopee!) Before the parties we would devise standards for marking, and then at the beginning of the meeting we would go over them, assuring each other that we understood them or had modified them to take account of the group’s consensus about how to proceed. If results required it, we would revise our standards and re-apply them.

The department head (I was that lucky individual) would prepare the test papers so the students’ names could not be seen by the teachers, and I’d set them up so they would be graded by two teachers, neither of them that student’s teacher that year. After each paper was marked, I would compare grades to see whether they were close to each other. If there was a large discrepancy, I would have the two teachers confer about their marks. These conferences did not occur too often, but usually one teacher would end up modifying a mark. When neither one budged, I would step in. Since the test score was the sum of the two teachers’ grades, I would issue a revised score and give the two teachers a brief opinion backing my action.

Even without conferences, teachers began to have a sense of whether they were too lenient or too severe in their marking by seeing how their marks compared to those of their colleagues. We also benefited from ideas our colleagues had, which had not occurred to us. The parties were a good way of smoking out problems that led to severe or lenient results. I found that by the end of a day of marking (for that is at least how long it usually took) we were closer in our marking than we had been at the beginning of the exercise.

We could use the results to determine what characteristic problems or strengths in writing or reading each class had and (after the beginning-of-year test) teach to those problems or strengths. Each of us had a stronger sense of how the others marked, and we tended to view essays critically in similar ways.

All these seemed like worthwhile goals for testing. We never considered the possibility of using the tests to rate teachers. This is not just because we shied away from such things: we had a program developed by an ingenious consultancy[1] whereby we visited each others’ classrooms and lessons, offering helpful and frank suggestions for increasing or strengthening the learning that took place there.

The reason that we didn’t use students’ results to evaluate teachers was that we didn’t believe in “proxy” values. When you evaluate a student, you evaluate a student. When you evaluate a teacher, you evaluate a teacher. It might make sense to use students’ results as part of the evaluation of a teacher if the method of evaluation relied on subtlety and good judgment and took other things into account than just students’ scores on a single “instrument,” but fortunately we were not in the position of having consequential decisions about teachers follow on our marking of students.


[1] Looking for Learning, developed by Fieldwork Education Services



Question Time

Much has been made of a recent study[1] that shows a correlation between the “effectiveness” of teachers as determined by the scoring of their students on “value-added metrics” and these students’ success in their later lives as determined by “markers.”  This muchness put me in mind—again—of Flannery O’Connor’s remark that “[t]he devil of educationalism that possesses us is the kind that can be cast out only by prayer and fasting.” I am not so sanguine as O’Connor: even prayer and fasting don’t often seem to work! I keep wondering what could possess whole communities of people to be stunned by a complex statistical study embodying years of data on millions of students when it concludes that children with good teachers do better than children with bad. One of the devils in the legion seems to be rather dim, but I will try to give the devil his due.

The original report is impressive in its thoroughness and the care with which its authors make and qualify their claims. They note, for example, that teachers in the study were not “incentivized based on test scores,” thereby skirting the effect of cheating, teaching to tests, and other “distortions in teacher behavior” that make the basis of value-addition different from what it would be in a population whose members had been “incentivized”—that is, in the real world of Campbell’s Law. There is no guarantee that results like this study’s would be similar to those in a district whose teachers were looking over their shoulders at the Value-added Reaper as he made his progress through their ranks. The twofold problem is that the use of “value-added metrics” encourages teaching to tests (the most-purchased books in the New York schools are books of preparation for tests), and there is evidence in research as well as the educational experience of the human race that teachers who teach to tests get worse results than teachers who don’t.

They caution that some elements of the value-added equation require “observing teachers over many school years” and may not apply in a “high stakes environment with multitasking and imperfect monitoring”—that is, precisely, the kind of environment in which hasty “consequential decisions” will be made on the basis of imperfect applications of the equation over the short term.

They point out as a justification for their aggregate numbers that “observable characteristics are sufficiently rich so that any remaining unobserved heterogeneity is balanced across teachers,” but those who want to use “value-added metrics” to make consequential decisions will be applying the equation to particular individuals without correction for “unobserved heterogeneity.”

They note that their study did not include the effect of peers and of parental investment in value-addition. While everyone agrees that the teacher’s effect on what students learn is pronounced, this seems like a significant omission that could have serious consequences for the teachers whose students’ peers and parents had a significant effect on the learning for which the teacher is held exclusively reponsible.

The authors state that the study’s assumptions “rule out the possibility that teacher quality fluctuates across years.” Can this be? Raise your hand if your quality was as good in your first year of teaching as in your tenth.

In addition to what the authors say in qualification and limitation of their results, I have a few questions. They say that “value added is difficult to predict based on teacher observables.” Do the people who want to use value-added metrics as the basis for personnel decisions want to go a step farther and assert that there is nothing observable that a teacher can actually learn or plan to do or avoid that will make a difference in how she or he scores?  This seems like a bizarre position for someone who believes in life-long learning.

I want to understand in non-mathematical terms how “academic aptitude” is factored into the equation so that teachers will not be “penalized” for taking classes of difficult or refractory students. It seems to be a single number (ηi) in the equation, but how is it derived?

I would like to know how many years’ value-added ratings they think a teacher should receive before the ratings can be said to reflect his or her actual performance, and I would like to understand the basis for this determination. It is one thing to say that we have some aggregate statistics that show teachers in general have certain effects on their students in the long run, and a rather different thing to say that these statistics can reliably rate individual teachers in one or two goes. This is particularly true given that the authors themselves say some elements of the value-added equation require “observing teachers over many school years.”

Having asked my questions I now make a couple of observations. One of the study’s authors, according to The New York Times, says that value-added metrics should be used even though “mistakes will be made” and “despite the uncertainty and disruption involved.” It is disturbing to see someone so fastidious in the drawing of conclusions become so sweeping and remorseless in applying them, particularly when the study itself has just spoken to the need to “weigh the cost of errors in personnel decisions against the mean benefit from improving teacher value-added.”

The problem with “mean benefits” is that they have particular consequences. The authors have said that they think it would be more cost-effective to fire ineffective teachers (even mistakenly ineffective ones) than to give bonuses to effective ones. I keep wondering whether this kind of decision-making will be ethos-effective. I keep wondering who is going to be attracted to a profession governed by such principles and assumptions as those that lie behind value-added systems. “Drifters and misfits,” as Hofstadter called them? The authors of the study note that no observable teacher behavior correlates to value addition, so I wonder who will join a profession in which it cannot be said with confidence what he needs to do in order to be successful.

The moral and intellectual world in which the discernment of quality was a matter of finesse or connoisseurship and in which reward and reprobation follow particular deeds or ways of doing things is the same one in which we could say without a quantitative rationalization that the students of good teachers do better than the students of bad. That world is also a place where both teachers and administrators take their duties seriously, including the duty to counsel and correct when needed and to accept counsel and correction when deserved or needed.

It might be worth ending with a note on the stereotype that people who are against value-added “measurement” are unionists, educational bureaucrats, or people with tenure to lose in a change of system. In my twenty-five years as a teacher I have never worked within a tenure-granting system.  I have never been in a union shop, nor have I been a member of a teachers’ union. I have never held an administrative position in education except that of Department Head. I have never worked in a teachers’ college. If I am against the kind of practice discussed in this posting, it is not because I have a hidden interest. It is because it seems wrong. I mean both wrong-headed and culpable.

[1] “The Long-term Impacts of Teachers: Teacher Value-added and Student Outcomes in Adulthood” by Raj Chetty, John N. Friedman, and Johah E. Rockoff of Harvard.



(Brick and Mortar) Schools

It’s time to stop using the expression “brick-and-mortar school” as if there is any other kind. I mean in particular to oppose the terms “virtual” and “on-line” being applied to schools, for such network-connections-and-data-bases don’t act as schools except in a threadbare and impoverished sense.

Or are they even as good as threadbare? The standard of “progress” mandated by No Child Left Behind, described generously as “very crude” by Professor Gary Miron of Western Michigan University, would qualify as threadbare. And yet applying even that standard, a recently released study co-authored by Professor Miron showed that on-line “schools” did worse at “improving” their students than “brick-and-mortar” schools did. (It also showed that for-profit “schools” did worse than non-profit schools.)

The late sociologist James S. Coleman did a large study reported in his 1987 book Public and Private High Schools. In it he found that the single strongest correlate of effectiveness in ordinary high-school education was that the schools in which the effective education took place were functional communities. A network is not a community, though some communities do function partially through networks. There is certainly nothing communitarian in an arbitrarily collected group of young people sitting by mandate in front of screens. Nor do such groups bear any resemblance to the ad hoc groupings (not communities) sometimes found on social networks, whose members make a choice to share some limited interest or focus. That is one reason we distinguish between communities and interest groups or single-interest constituencies; but we should also distinguish between networks and any of those other collections, for a network need have none of the above.

For something to be “virtual” in the traditional sense, it must operate under some kind of power or agency (a “virtue”) that has an essential and sufficient effect even though the thing in question does not take its usual form. What essential and sufficient agency is at work in a “virtual” school? Surely the answer can’t be “instruction”! Of the three kinds of learning—knowledge, skill, and understanding—educational software can hope to deliver only knowledge. Skill requires coaching, and the last time I looked, almost all coaches were genuine human beings, for how could they not be in order to adapt themselves to their students’ needs? And the promotion of understanding requires Socratic questioning, which software cannot provide, for something like the reason that it cannot play a good game of gō.[1]

When I think of software providing understanding, it puts me in mind of the electronic confessional in THX 1138. The Donald Pleasance character receives “understanding” from his “confessor,” but the movie invites us not to congratulate the effectiveness of future cybernetics but to mourn the threadbareness of a life to which that “confessor” could offer anything significant.

In the most famous example of Socratic questioning, Socrates himself hears his acquaintance Thrasymachus assert that justice is the interest of the stronger party. Socrates asks him a series of questions whose answers lead Thrasymachus to understand that justice cannot possibly be what he has just claimed. Socrates holds him to each answer he gives by asking one more question about that answer till Thrasymachus grasps fully why he was in error to make that assertion. This is not something that can be programmed because—in real life, if not in a dialogue planned by Plato—the programmer cannot know what a respondent’s next answer will be to an open-ended question, and it is these open-ended questions that force the respondent to step out of the box of slogans and memorized lines that he brought to the discussion. Until then, “justice” might as well be the montillation of traxoline.

Good teachers understand all of this, which is why some teachers in Idaho (and elsewhere) are protesting the mandating of online “schooling.” One of them, Ms. Ann Rosenbaum, sounds like a formidable person and a dedicated teacher, and one not to shrink from a struggle. It is a pity that she must come up against such sorry adversaries as Idaho’s governor Otter and its schools superintendent Luna. Luna falls back on vacuous clichés like “schools of the 21st century,” while Otter says that if Ms. Rosenbaum “only has an abacus in her hand, she is missing the boat.” Of course, that is not the only thing that Ms. Rosenbaum has in her hand, as the article shows. (Thankfully, it doesn’t show what Governor Otter has in his hand.)

But she doesn’t need anything in her hand when she is using the Socratic method: “engag[ing] students with questions” and “using each answer to prompt the next” question. Of all the questions Socrates asks Thrasymachus, only the first one could appear on question-and-answer software. Ms. Rosenbaum doesn’t want to give up a rich line of questioning for haring around fields of knowledge with questions asked arbitrarily, which is basically what question-and-answer software does.

A “virtual school” is not a community, nor can it be one. It does not have a sufficiency of action by virtue of which it offers a complete education. It will provide coaching for skill at about the same time that country clubs can replace the pro shop by the machine shop. It cannot impart or ratify understanding. Why are we calling it a “school,” and why are we moving towards such things? I am afraid the answers to these questions have little or nothing to do with education. While we are turning up the answers, let us refrain from “saying the thing that is not,” as Jonathan Swift called it[2];for an on-line “school” is not a school.



[1] This was written before Alphago, but I still find it unlikely that educational software will be able to respond in a genuine way to students’ comments any time soon. In the meantime, human beings should do just fine as teachers, and barkers of “educational” software should wait till it does before touting it. (17 March 2016)

[2] While Gulliver was in the land of the Houynhnhms