Lawyer-turned-data-scientist David Colarusso analyzed 2.2 million sentencing records from Virginia to determine the relationship between race, income and treatment in the criminal justice system.
Colarusso uses his analysis as a primer on data analysis, explaining regression and data normalization in simple language accessible to people without a statistics background.
But then he gets to the conclusion, which shows that race swamps class as a factor in sentencing, saying that "for a black man in Virginia to get the same treatment as his Caucasian peer, he must earn more than half a million dollars a year."
What's more, "the same holds for American Indians, Asian or Pacific Islanders, and Hispanics."
Update: Peer review works! Some of Colarusso's readers reviewed his work and found an error that put the number closer to $90,000 than $500,000. He writes, "Additionally, the offset for Asian defendants turned out to be a little less than half that of Black, Native, and Hispanic defendants."
I did all of my analysis with freely available tools, and there's nothing stopping you from picking up where I left off. In fact, I hope that a few of you will look at this GitHub repo and do exactly that. However, it's important to note that you need a solid foundation in statistics to avoid making unwarranted claims due to lack of experience.
And beware the danger zone! As Drew Conway (creator of the Venn Diagram above) points out, "It is from [that] part of the diagram that the phrase 'lies, damned lies, and statistics' emanates."
That being said, there is nothing magic here. You can also discover hidden truths. My advice? Be suspicious of answers that reinforce your existing assumptions. Do your work in the open. When confidentiality allows, share both your findings and your data. Have someone check your math. Listen to feedback, and always be ready to change your mind
Uncovering Big Bias with Big Data