American education's use of "value added measures" is statistically bankrupt

American teachers are widely assessed on the basis of "value added measures," a statistical tool for analyzing the outcomes of their teaching. But as Jerry Genovese points out, this is statistically completely bankrupt — unless you randomize your samples, you get no insight into the quality of the teaching. I asked my father, Gord Doctorow — a mathematician, math teacher, and professor of education — what he thought of Genovese's piece, and he sent me some great material, which you'll find after the jump.

In the past few weeks I have been analyzing data from a research project. The topic is not important for our discussion here, the methodology, however, is. The approach I am using is called a gain score analysis. Participants are assigned to one of two groups, each group will receive a different intervention. For each group we measured our outcome variable at baseline, that is before treatment. After the intervention we will measure our outcome variable again. Gain score is defined as the final measurement minus the baseline measurement. In other word the magnitude of the change. By focusing on the magnitude of the change we don't have to worry about the fact that the baseline scores were not identical. We use a statistical test to see if one group gained significantly more that the other.

A value added measure of teaching is also a gain score analysis. They measure the students' performance at the beginning of the year and then measure their performance again at years end. The difference would be the gain score or, as it is called in education, the value added. The average gain score for a group of students is said to be the value added by the teacher.

What is wrong with this approach? After all it seems to be identical to what my colleagues and I are doing in our research. Unfortunately, there is a crucial difference. In my study the participants were randomly assigned to the two groups. A gain score analysis can not be valid if the group assignments are not random.

Here's what my Dad added:

I agree with Jerry Genovese. There are several methodological problems with value-added evaluations of teachers, as I understand the concept from Jerry's blog. First, the issue of comparisons: he's right that sampling has to be random. Not only that, the sample size has to be sufficiently large (sufficient power) and representative. To be representative, the proportions of certain demographically defined groups of students have to be proportionally represented in the comparison groups. Besides that, there is the issue of what constitutes an appropriate measure of value. In the case of student scores, we need to know whether the tests of student performance are good predictors of future success. In Finland, the students are not exposed to such tests until later on when they compete in the PISA, which is an international test of performance by country. Yet despite, or perhaps because of, this lack of emphasis, they greatly outperform American kids. The value that Finns use to compare teachers is based on rigorous standards of pre-service education, including attainment of a Master's degree, and very competitive salaries. These teachers are expected to be knowledgeable and innovative. In the U.S., the teachers are expected to get their students to attain scores on standardized tests in a high stakes environment, which inevitably leads to cheating and sacrifice of creative learning opportunities.

Finally, in order to do a proper comparison of teacher performance, you have to eliminate (control for) variations in the student populations being served. Students learn at different rates, are subject to cultural influences, have varying degrees of home encouragement and support, and the list goes on. There can be no meaningful comparison among teachers who have vastly different student populations because a significant variable plays a confounding role.

Value added measures of teachers are invalid

(Thanks, Jeremy!)