The school where I had my flexible classroom gave us much room for movement in other things too. In general, the principle was to trust the teachers’ professional judgment, and in general the principle worked. One of the things the English department decided to do was to implement an in-house testing program using homemade “instruments,” i.e. tests, to size up the proficiency of our students.
We called them simply the Reading Assessment and the Writing Assessment. In one the students answered open-ended questions about a passage they had read; in the other, they wrote an essay on a prompt devised by the teachers. What interests me about these tests in retrospect, particularly in light of all the hubbub about high-stakes testing, was how little we thought along the lines laid out by the high-stakes testing people.
We had two purposes in giving the tests. One was to see whether, in general, we could identify problems and strengths in classes of students in order to shape what we taught and how we taught it. The other was to give us a chance to work together at grading the assessments, thereby achieving a broad consensus on marking.
It’s a good thing we didn’t give these tests for “high stakes.” Many years an insurgency developed among the students whereby they would throw their tests or otherwise rebel against taking them. Such insurgencies were rarely very big, but we usually ended up knowing what was happening and how they might affect results. But one year a particularly charismatic student managed to get a fairly widespread test-rebellion going. We found that in the class concerned, the tests were unusually poor, taken as a whole.
One set of ruined results did not have any dire consequences: we just marked them as we had planned and then discussed how we could head off test-rebelliousness in the future. But I wonder now what would have happened in a setting where such tests had high-stakes consequences for teachers’ raises or even their future employment. I have never heard of research being done on the problem of throwing tests even though such mischief is a credible threat to the integrity of test results.
The way we marked the tests was to call long meetings for such purposes, which we called “grading parties.” (Whoopee!) Before the parties we would devise standards for marking, and then at the beginning of the meeting we would go over them, assuring each other that we understood them or had modified them to take account of the group’s consensus about how to proceed. If results required it, we would revise our standards and re-apply them.
The department head (I was that lucky individual) would prepare the test papers so the students’ names could not be seen by the teachers, and I’d set them up so they would be graded by two teachers, neither of them that student’s teacher that year. After each paper was marked, I would compare grades to see whether they were close to each other. If there was a large discrepancy, I would have the two teachers confer about their marks. These conferences did not occur too often, but usually one teacher would end up modifying a mark. When neither one budged, I would step in. Since the test score was the sum of the two teachers’ grades, I would issue a revised score and give the two teachers a brief opinion backing my action.
Even without conferences, teachers began to have a sense of whether they were too lenient or too severe in their marking by seeing how their marks compared to those of their colleagues. We also benefited from ideas our colleagues had, which had not occurred to us. The parties were a good way of smoking out problems that led to severe or lenient results. I found that by the end of a day of marking (for that is at least how long it usually took) we were closer in our marking than we had been at the beginning of the exercise.
We could use the results to determine what characteristic problems or strengths in writing or reading each class had and (after the beginning-of-year test) teach to those problems or strengths. Each of us had a stronger sense of how the others marked, and we tended to view essays critically in similar ways.
All these seemed like worthwhile goals for testing. We never considered the possibility of using the tests to rate teachers. This is not just because we shied away from such things: we had a program developed by an ingenious consultancy[1] whereby we visited each others’ classrooms and lessons, offering helpful and frank suggestions for increasing or strengthening the learning that took place there.
The reason that we didn’t use students’ results to evaluate teachers was that we didn’t believe in “proxy” values. When you evaluate a student, you evaluate a student. When you evaluate a teacher, you evaluate a teacher. It might make sense to use students’ results as part of the evaluation of a teacher if the method of evaluation relied on subtlety and good judgment and took other things into account than just students’ scores on a single “instrument,” but fortunately we were not in the position of having consequential decisions about teachers follow on our marking of students.
[1] Looking for Learning, developed by Fieldwork Education Services