A recent article on Education Minister Simon Birmingham’s website entitled ‘Lifting the Bar on Quality Teacher Education‘ states that ‘The Turnbull Government is lifting the bar on teaching quality with the rollout of performance assessments for new teachers graduating in 2018.’ and that ‘Minister Birmingham said all new teachers would need to pass a teacher performance assessment before they could graduate.’
This prompted me to share a correspondence on teacher observations and assessments that I had recently with Heather HIll, Jerome T. Murphy Professor in Education at the Harvard Graduate School of Education. I share it here in the hope that Heather’s response can be considered by those groups and individuals who are making decisions regarding this proposed teacher performance assessment.
I asked Heather (with minor edits):
Ollie Lovell here, head of senior maths at a Melbourne School (Australia). Trying to build a healthy culture of teachers visiting each other’s classes.
Dylan William quotes you on the Craig Barton podcast (an excellent episode by the way!) and says ‘Heather Hill’s work at Harvard suggested that a teacher would need to be observed teaching 5 different classes, with every observation made by 6 independent observers to really be able to reliably judge a teacher.’
But searching through your work I wasn’t able to find this statistic referenced at all?
Am I just looking in the incorrect places?
In some ways I’d love for this statistic to be true. It would help me to introduce more classroom observations in a non-threatening way. But in another way, I’m doubtful, because There are things that we can observe (feedback, spaced repetition, good modelling of problem solving and metacognition, etc) that can point to whether a teacher is or isn’t on the right track.
Further to this point, I’ve seen people quoting this paper (1) recently as suggesting that “We can’t tell good teaching when we see it” (see tweet here) but in this paper they used ‘thin slicing’, only showing assessors less than 5 minutes of footage from teachers’ classrooms. My response to this would have been ‘Teaching takes place in time, but learning takes place over time (John Mason).’ It makes sense that it’s impossible to tell the quality of a teacher from such a small clip, but that doesn’t necessarily mean that accurate assessments can’t happen in less than 30 observations. I feel that, personally, by observing a teacher in 2 to 3 lessons I can get a pretty good feel for what they’re like, as well as identify some of their key strengths and areas for improvement.
If you’ve got time, I’d love to know where this original stat came from and your take on assessing teachers through observations more generally.
Hope this finds you well Heather and I hope to hear back.
All the best.
Ref 1: Strong, M., Gargani, J., & Hacifazlioğlu, Ö. (2011). Do we know a successful teacher when we see one? Experiments in the identification of effective teachers. Journal of Teacher Education, 62(4), 367–382.
Thanks for your question about how many observations are necessary. It really depends upon the purpose for use.
1. If the use is teacher professional development. I wouldn’t worry too much about score reliability if the observations are used for informal/growth purposes. It’s much more valuable to have teachers and observers actually processing the instruction they are seeing, and then talking about it, than to be spending their time worrying about the “right” score for a lesson.
That principle is actually the basis for our own coaching program, which we built around our observation instrument (the MQI):
The goal is to have teachers learn the MQI (though any instrument would do), then analyze their own instruction vis-a-vis the MQI, and plan for improvement by using the upper MQI score points as targets. So for instance, if a teacher concludes that she is a “low” for student engagement, she then plans with her coach how to become a “mid” on this item. The coach serves as a therapist of sorts, giving teachers tools, cheering her on, and making sure she stays on course rather than telling the teacher exactly what to do. During this process, we’re not actually too concerned that either the teacher (or even coach) scores correctly; we do want folks to be noticing what we notice, however, about instruction. A granular distinction, but one that makes coaching much easier.
2. If the use is for formal evaluation. Here, score reliability matters much more, especially if there’s going to be consequential decisions made based on teacher scores. You don’t want to be wrong about promoting a teacher or selecting a coach based on excellent classroom instruction. For my own instrument, it originally looked like we needed 4 observations each scored by 2 raters (see a paper I wrote with Matt Kraft and Charalambos Charalambous in Educational Researcher) to get reliable scores. However, my colleague Andrew Ho and colleagues came up with the 6 observations/5 observer estimates from the Measures of Effective Teaching data:
And looking at our own reliabliity data from recent uses of the MQI, I tend to believe his estimate more than our own. I’d also add that better score reliability can probably be achieved if a “community of practice” is doing the scoring — folks who have taken the instrument and adapted it slightly to their own ideas and needs. It’s a bet that I have, but not one that I’ve tested (other than informally).
The actual MQI instrument itself and its training is here:
We’re always happy to answer questions, either about the instrument, scoring, or the coaching.
I hope that Professor Hill’s thoughtful take on assessing teachers through observations can be taken into consideration as decisions are made regarding this new policy.
Edit: Since posting this I’ve had this document brought to my attention which, helpfully, provides some background to this current policy push. It also provides some guidance as to what such performance assessments could look like.