Data Schmatta

The Gates Foundation’s Measures of Effective Teaching (MET) project produced two culminating reports last year at a cost of  f o r t y – f i v e  m i l l i o n  d o l l a r s. The goal was to test the evaluation of teachers by students and by “value”-“added” “metrics” (VAMs) and to promote the use of these evaluations combined with each other and with classroom observations (called composites when taken together in equal measure). The reports were the object of a careful and thorough demolition by Professors Jesse Rothstein of Cal Berkeley and William J Mathis of the University of Colorado. Allow me to conduct you briefly through the ruins.

•   “[T]eachers whose prior year classrooms were especially high- or low-achieving were not included,” which means that the recommendation of a comprehensive policy rests on the findings of a non-comprehensive study.

•   “Among the MET districts, the fraction of students who actually were taught by the teacher to whom they were randomly assigned ranged from about one-quarter to two-thirds,” which means that teachers’ VAMs were often based on the scores of students other than the ones they were assigned in the study.

•   The study “presumes a single ‘general’ teaching quality factor” whose existence is questionable (see my posting on The Phantom VAMs), which means that VAM-based evaluation may be like that new set of clothes for the emperor.

•   “The reports do not consider the substantial qualitative literature that considers teaching as a multidimensional enterprise that serves a variety of purposes beyond test-score improvement,” which means that it is as worthless to rely solely on measurement to determine teachers’ quality as it would be to rate the pictures in the Louvre using a scorecard.

•   “[A]lternative, more conceptually demanding tests, which appear to capture an additional dimension of teacher effectiveness that is not measured by the regular state tests,” showed a correlation of 0.25 or less with results on the “composites.” This leads Rothstein and Mathis to warn that “[t]here is evidently a dimension of effectiveness that affects the conceptually demanding tests that is not well captured by any of the measures examined by the MET project.” Read Melville’s “The Whiteness of the Whale” and answer these questions: 1. Moby-Dick is A. white B. pink C. puce D. chartreuse E. black.  2. Whiteness is A. ambiguous B. troublesome C. horrifying D. polysemic E. mighty fine.

The critics’ conclusions:

•   “There is no statistical procedure that can decide which dimensions should count more in teacher evaluations; that decision must be left to subjective judgments.”

•   “[S]ome of the dimensions of effectiveness—those measured by student performance on alternative, open-response (i.e., not multiple choice) tests designed to capture higher-order thinking—are not at all well measured by the standardized measures that were the focus of the MET research.” [emphasis added]

•   An evaluation system based on the MET measures “will likely discourage teachers from putting their efforts towards the kinds of learning that the alternative tests measure—in this case, higher-order, conceptual thinking.”

•   “There are a number of ways for a teacher to influence her VA score other than through actually becoming more effective: She can arrange to be assigned a different set of students; she can reduce the amount of class time that she devotes to non-tested subjects (e.g., history) in order to allocate more time to the subjects covered by the tests; she can teach test-taking skills in place of substantive knowledge; and she can try to rally students to give their best effort on a test which is usually, from the student’s perspective, entirely meaningless.”

And do bonuses for good scores result in good scores? Rothstein and Mathis note that experimental analysis of “pay-for-performance” programs in the U.S. shows that “teachers eligible for bonuses do not produce higher value-added scores—let alone better teaching along other dimensions—than do control-group teachers who are ineligible for bonuses.”

A friend of mine has done some research, not yet published, that casts another shadow over the use of one of the MET “composites.” It reveals that university students often throw their “ratings” of teachers in favor of those who give light work and against those who make difficult or troublesome demands of them. The problem is that improvement in critical thinking has been shown to coincide with taking reading- and writing-intensive courses and studying under teachers with high expectations.

Not that the students themselves are always good judges of their own proficiency. My friend’s data show that more than half of the students surveyed think themselves proficient at reading scholarly work, but their teachers are almost unanimous in claiming that they do not possess this power. And what do these same students think is an unreasonable expectation of work assigned? The unpublished results say that 90% of them find 40 – 50 pages a week for a 3-credit university course “too much reading.” It is a sorry reflection on their expectations that some Cliff’s Notes would be counted as excessive reading, and it is a sorry reflection on the abdication of sound judgment by the Ed Biz to realize that these same students’ ratings contribute at some schools and universities to corrupted decision-making about their teachers’ continuing to work.


Gaining Students’ Attention: William James Meets Chloe of the Swamp

2016 will mark the hundredth anniversary of the posthumous publication of William James’s Talks to Teachers on Psychology, and to Students on Some of Life’s Ideals. I am jumping the celebratory gun in order to deal with the troublesome connection between the ability to pay attention and the related ability to do repetitive or otherwise “repulsive” work, as James calls it in the Talks.

This is an important issue to teachers because they are under the two seemingly competing pressures to give thorough, effective instruction and to make their lessons interesting. As much as one might like to think otherwise, effective instruction will necessarily entail some repetition, and so the problem is how to get students to attend to it when they have seen or done it before.

James helps us to understand how to approach and with some planning to solve the problem. In the Talks’ lecture on Attention he distinguishes between two kinds of attention: passive attention and voluntary attention. The first kind is the attention we pay without trying; the second requires effort and discipline. The first seems to sweep us away, hence our “passivity”; the second requires us to do the sweeping.

James notes that the simplest way to attend continually to something is to find continually new associations with it. This is what the “genius” does. The rest of us, not quite so fertile in our associative thinking, must take pains (as indeed must the genius in spheres outside his area of ingenuity). The newness must be of association to what is already known: as James says, “The absolutely new makes no appeal at all. The old in the new is what claims the attention—the old with a slightly new turn.”

A personal favorite illustration that the absolutely new makes no appeal at all was in my old Aunt Augusta (yes, I actually had one). Her son’s family took her to see Star Wars, then a new movie in general release. Old Aunt Augusta  said, “I couldn’t tell what was going on. All those lights! All that noise!” The truth is that young students can be as resistant to novelty as old Aunt Augusta. Examples are classes I have seen in their initial reaction to Picasso’s Guernica and their reaction after they have made some connections of it to material they are more familiar with. Or, for that matter, their initial bewilderment over a Wallace Stevens poem (like “Disillusionment of Ten O’Clock”) and their interest if they have been able to anchor it in something they can hold on to (including the jingle that begins “Red sky at night, / sailor’s delight.”) Even a photo means more if there’s a way in. I have mentioned students’ reaction before the Harry Potter books to a picture of Professor James Murray: he was a weird old bearded man reading a book. Now, they say with varying degrees of delight, “He looks like Dumbledore.”

It is the teacher’s job to help make these associations in order to preserve students’ attention against its inevitable tendency to wander.  “The genius of the interesting teacher,” says James, “consists in sympathetic divination of the sort of material with which the pupil’s mind is likely to be already spontaneously engaged, and in the ingenuity which it discovers paths of connection from that material to the matters to be newly learned:” “If the topic be highly abstract, show its nature by concrete examples. If it be unfamiliar, trace some point of analogy in it with the known. If it be inhuman, make it figure as part of a story. If it be difficult, couple its acquisition with some prospect of personal gain. Above all things, make sure that it shall run through certain inner changes, since no unvarying object can possibly hold the mental field for long.”

But the teacher has another job as well: to get students to attend to things when even the power of calling up a variety of associations fails or is at an end. Since “most schoolroom work, till it has become habitual and automatic, is repulsive, and cannot be done without voluntarily jerking back the attention to it every now and then,” the teacher must vary the way in which material is reviewed: “The posture must be changed; places can be changed. Questions, after being answered singly, may occasionally be answered in concert. Elliptical questions may be asked, the pupil supplying the missing word. The teacher must pounce upon the most listless child and wake him up. The habit of prompt and ready response must be kept up. Recapitulations, illustrations, examples, novelty of order, and ruptures of routine,—all these are means for keeping the attention alive and contributing a little interest to a dull subject. Above all, the teacher must himself be alive and ready, and must use the contagion of his own example.”

Finally, a good teacher must take advantage of “borrowed interests.” If a student can no longer find an intrinsic or associated interest in scanning a poem, he can “borrow” an interest in gaining recognition, gaining the admiration of his classmates, getting a desired grade (by real work), gaining a teacher’s favor (the teacher must in such cases be real not virtual), and of course avoiding disfavor and punishment. In a culture where none of these interests exist or are taken seriously, they cannot be borrowed!

A number of these resources from the old Bag of Tricks came together yesterday in a field trip I took with my colleague the biology teacher and her Grade 10 students[1]. It may surprise people familiar with Hong Kong’s reputation as the Vertical City that its geography includes mangrove swamps, but that is so. One of the places where they enjoy the government’s protection is called Hong Kong Wetland Park. The park has a strong mission of education as well as of preservation, and so it has been made wonderfully available to students. For a fee of about US$50, twenty-seven students and three teachers had a day-long program of activity. The morning was given over to a tour of the swamp’s boardwalks, blinds and viewpoints conducted by a qualified naturalist. In the afternoon the students went to one of the visitor center’s laboratories to make tests of water samples drawn from two different locations in the swamp. In these they were assisted/supervised by three other trained young scientists. The students’ task was to test for eight physical and eight chemical parameters of the water. The leader of this activity asked them to decide whether there were any human or systemic errors in their work (there were), and was able to advise at least two students that their questions would make good topics for university-level research.

But in addition to all the new material about mangroves, the day helped give the students needed practice in some of the tedious pick-and-shovel work that inevitably attends laboratory science as it is really done. The change of scene worked to establish associated interest. The work with “new” people piqued curiosity and borrowed the interest of appearing to advantage in front of strangers. Having to take the tour in Hong Kong’s infamously bad summer weather (90 degrees, 85% humidity, and tropical sunshine) contributed to a sense of accomplishment (or relief) at the tour’s end. Certificates awarded on completion borrowed another interest. Finally, the biology teacher’s promise of extra ice cream for those who did certain things in the best time added a bit of excitement too.

Since we can’t always go trooping off to the neighborhood mangrove swamp to perk things up or turn out our pockets to buy ice cream, we also need to find ways of bringing into ordinary classrooms the ability to find associative and borrowed interest for the students to bring to what they learn. Otherwise, all we can do is coerce attention, and as James rightly points out, that is bound to fail: “You can claim [attention], for your purposes in the schoolroom, by commanding it in loud, imperious tones; and you can easily get it in this way. But, unless the subject to which you thus recall their attention has inherent power to interest the pupils, you will have got it for only a brief moment; and their minds will soon be wandering again.”

[1] She is also an experienced teacher trainer of long standing.


Value Added Metrics and the Bíg, Bláck Blóck

“[VAMs] are not yet, I think, up to the task of being put into, say, an index to make important summative decisions about teachers.”—Morgan S. Polikoff

To sit in solemn silence in a smáll clássróom

While reading the VAM ratings that will séal óne’s dóom,

Awaiting the sensation of a shórt, shárp shóck

From a cheap and chippy chopper on a bíg, bláck blóck.

—with apologies to W. S. Gilbert

Now that a judge has thrown out California’s system of teacher tenure, it becomes more important than ever to ensure that decisions to fire “ineffective” teachers do not become arbitrary and capricious. Unfortunately, decisions based on “value”-“added” “metrics” are likely to be just that. Professor Polikoff, whose statement opens this posting, has carefully studied the correlation of VAMs and teacher effectiveness and found that it is at or near zero. If you don’t want to spend thirty dollars for his study,  a Youtube clip is available in which Polikoff discusses his findings.

He notes in that clip that “state tests don’t seem to be sensitive to many of the things we think of as defining quality instruction.” If you have read my last posting, on Chesterton and the Beanstalk, you have a commonsense explanation why not. As it turns out, I can also oblige the Research Enthusiasts, this time with a study of science teaching[1]. This study opens with a bang: “Education reform usually arrives with fanfare, great expectations, and overconfidence. Truth be known, typical education-reform effects tend to be small. Evaluations, if done at all, burst the reform balloon, having difficulty finding effects. After some period of time, enthusiasm and financial support wane. The remnants of reform show but faint traces of the great expectations.” Does that sound familiar to victims, I mean implementers, of NCLB and RAce to the Top[2]?

Ruiz-Primo and her colleagues approached assessment by examining its proximity to the student. They rated assessments as having six degrees of separation, as it were, from the class, and they noted whether the assessments rated “declarative knowledge,” “procedural knowledge” or “strategic knowledge,” i.e., knowledge, skill, and understanding, in an “expanded idea of achievement”[3]. What they found was that “distant” assessments, e.g., state and national tests, were less likely to capture the varieties of learning than were “close” assessments, e.g., science notebooks. As they put it in their conclusions, “close assessments were more sensitive to changes in student performance, whereas proximal [i.e., distant] assessments did not show as much impact of instruction.” I have argued elsewhere that VAMs based on such assessments measure nothing at all, and that they are arbitrary and capricious in their effects.

When we compare using VAMs to make personnel decisions in education with the methods of Gilbert and Sullivan’s Lord High Executioner, we must regretfully conclude that His Lordship is on a solider footing than the VAM enthusiasts: at least he works for a “more humane Mikado” who “lets the punishment fit the crime.” In the case of VAMs, which bear little relationship to “teacher observables,” the people being punished have not even been detected in a “crime.”

[1] Ruiz-Primo, Maria Araceli et al. “On the Evaluation of Systemic Science Education Reform: Searching for Instructional Sensitivity.” The authors mean sensitivity to the effects of instruction, not sensitivity in instructors or their means. With becoming modesty they note that their conclusions are tentative.

[2] …and new math, and open classrooms, and whole language, and outcome-based education, and mastery learning, and …

[3] p. 374


Chesterton and the Beanstalk

G. K. Chesterton, one of the writers recommended for study in the Grade 12 Common Core, was a very big man, and as genial about his weight as about everything else. He was giving a public lecture during the Great War, which began a hundred years ago this summer, and during question time he was asked by a gimlet-eyed member of the audience, “Why aren’t you at the front?” He replied, “If you go around to the other side, you will see that I am at the front,” which brought down the house. I am not going to have Chesterton take Jack’s place on the beanstalk, but will come back to him after we have examined Jack’s climbs.

For Jack & Chesterton both illustrate in their own ways the radical unsatisfactoriness of the plan by RAce to the Top (RAT) to support and require standardized multiple-choice tests of literary understanding in the Common Core. There are two versions of “Jack and the Beanstalk” frequently used, of which the commoner by far is the one by the folklorist Joseph Jacobs. The other is by the Victorian moralist Benjamin Talbart. The Jacobs version is also more satisfactory: It is closer to the story as it was known in the oral tradition, and it happens to have an ambiguity that can be wonderfully productive of good thinking by second-graders who read it. The Great Books Foundation has even produced sample lessons for second-graders on “Jack” using the Jacobs version. In it two issues come up for consideration that do not have a single “best” answer:

1. Why does Jack climb the beanstalk the third time?

2. Why is the ogre’s wife kinder to Jack than is his own mother?

It is possible, as Jacques Barzun says in another context, that “diversity will prevail, one or more groups and individuals being persuaded or confirmed in a different position. And that too is highly instructive.”   In a properly conducted discussion, even second-graders can be brought to justify their explanations, to ask questions of children who hold other views, and to accept that reasonable people can differ on the answer.

But multiple-choice tests are not equal to even this level of second-grade thinking because they always require “the” “best” answer, or a fixed combination of “best” answers. Another reason that a nationally standardized test might not work is that multiple versions of a story may exist but no single one is required. A third is that a teacher taking intellectual short cuts—say, for test preparation—can substitute items of declarative knowledge for the exploration and discussion that foster true understanding. (As has been shown[1], it is possible to answer correctly about items of “knowledge” on a multiple-choice test without having the slightest understanding of them.)

A Common Core reading for Grade 8 is “The Road Not Taken” by Robert Frost. I still remember my 8th-grade classmate Eric giving an extraordinarily good recitation of the poem by heart, as do two friends of mine who were in that class too. Eric took the view that the speaker of that poem was an indomitable individualist, which I understand is the predominant reading. But it happens that the speaker can also be heard as an ambivalent ditherer, and Frost himself seems to support that view, having said that the speaker reminded him of a good friend who could not make up his mind. There is certainly room for discussion, which would be recognized in a good classroom or on a good essay but not on a multiple-choice test.

And so we come to Chesterton. What is going to happen when 12th-graders encounter the Prince of Paradox on their Common Core syllabus after spending eleven years “learning” canned “interpretations” and taking literary works as secret codes for declarative knowledge given to them by their teachers and asked of them by multiple-choice questions? Take for example “A Piece of Chalk,” justly regarded as one of his best essays. In addition to being an accomplished writer and controversialist, Chesterton was an accomplished amateur artist, as this whimsical self-portrait shows. In the essay he sets out for the country to draw, after having secured the needed materials, including brown paper and a piece of white chalk among other colors. Why brown paper? Because he “liked the quality of brownness in paper, just as [he likes] the quality of brownness in October woods, or in beer. Brown paper represents the primal twilight of the first toil of creation”. I can imagine a discussion of why Chesterton likes brown paper, but I can’t imagine a multiple-choice question about it that would be sure to elicit genuine understanding.

And what does he end up drawing? “[T]he soul of a cow; which [he] saw there plainly walking before [him] in the sunlight; and the soul was all purple and silver, and had seven horns and the mystery that belongs to all beasts.” I picture multiple-choice questions about its gait, its coloring, and the number of its horns; but what question will elicit genuine understanding while smoking out faked understanding? That is what Socratic discussion and essays are for.

In a discussion or an essay the student could link the drawing of a cow’s soul to the following statement that “though [he] could not with a crayon get the best out of the landscape, it does not follow that the landscape was not getting the best out of [him].” The student could discuss the transition it provides to a general discussion of the linkage of art, nature and spirituality in more than just Romantic ways. If the student had already read Moby-Dick, he might contrast Chesterton’s view of whiteness and Herman Melville’s in the chapter on “The Whiteness of the Whale.” After further thought the student might then be ready to try putting together an understanding of Chesterton’s closing sentences in the essay: And I stood there in a trance of pleasure, realizing that this Southern England is not only a grand peninsula, and a tradition and a civilization; it is something even more admirable. It is a piece of chalk.

1.     Chesterton believes that

I.          Southern England is a piece of chalk.

II.        A piece of chalk is better than English civilization.

III.     White chalk symbolizes a religious response to Romantic naturalism.

A.     I and II  B. I and III  C. II and III  D. I, II and III  E. None of the above

2.     You believe that

A.    Chesterton is unsuitable for study because genuine understanding of his work cannot be captured in multiple-choice questions.

B.     Multiple-choice questions are unsuitable for examinations because they cannot capture genuine understanding of Chesterton.