Understanding spurious correlation in data-mining

Last May, Dave at Euri.ca took at crack at expanding Gabriel Rossman's excellent post on spurious correlation in data. It's an important read for anyone wondering whether the core hypothesis of the Big Data movement is that every sufficiently large pile of horseshit must have a pony in it somewhere. As O'Reilly's Nat Torkington says, "Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this."

* If good looks and smarts are distributed normally, and

* If good looks and smarts have nothing to do with each other, and

* If movie producers want both smarts and looks

* Then, by observing employed actors we’ll assume that looks and smarts have a negative correlation

* Even though we constructed this experiment with no correlation

Here’s a graph of 250 randomly generated points (with no correlation). With the red circles representing “actors who are smart and good looking enough to get a job (looks+smarts>2), and lighter blue x’s representing “people who wanted to be actors”

Clearly if we only look at actors with jobs, we’ll see a clearly negative correlation between smarts and good looks. In fact, some brilliant actors are less attractive than an average person, and some gorgeous actors are dumber than an average person. Even more interesting though, is that if we try to rule out bias by looking at aspiring but unsuccessful actors as well, we’ll find that they exhibit a similar correlation...

You’re probably polluting your statistics more than you think (via O'Reilly Radar)

Notable Replies

  1. Wait - like a study showing that some racist, white people have guns?

    Now I'm confused.

  2. Nylund says:

    "Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this."

    That's all good and true. You probably can't trust the statistical results presented by someone who never really learned statistics. Similarly though, you can't ever put too much weight on critiques of statistical techniques by people who never really learned statistics either. I'm not saying the latter applies here.

    I think a clearer message would be, "Dear non-statisticians, please realize that other non-statisticians do bad statistical analysis."

    There really isn't much here for statisticians to learn from. They already know this stuff. It's more about educating laypeople and/or creating a sense of superiority in one group of non-statisticians over another group of non-statisticians.

    The bad thing about "geek chic" is that it's vastly increased the number of dilettantes who like to lecture people about math and science when their credentials don't extend much past having watched Battlestar Galactica or whatever it is they think gives them nerd-cred.

  3. SamSam says:

    I don't see the cause of confusion.

    It's one thing to say "If you take a humongous pile of Big Data and just randomly run regressions, you are going to find correlations that don't exist."

    It's another thing to say "if you have a hypothesis about two things, and gather the data and see that there is a correlation between the two things, then that adds some evidence that there is a connection between the two things."

    In the case of the guns study, there are numerous plausible hypotheses (off the top of my head: there are more people with guns in certain states, and there are more people who got a higher score on that "symbolic racism" test in those states (maybe because of the wordings in the test), so the two are therefore correlated). Just because this one emotionally got your goat (because it involves guns) doesn't mean you have to say that all statistics are bullshit.

  4. Nylund says:

    As the post states:

    Given two measurements xi in X and yi in Y on a set of points p1…n in P, if the value of xi+yi increases the chance that pi will be sampled, it will introduce a phantom correlation between X and -Y

    For the gun/racism thing that would translate to a "phantom correlation" if your gun-ownership status plus your racism score made it more likely that you would show up in the American National Election Study. If that is the case, then this particular issue is one to be aware of.

    Out of all the potential statistical problems with the gun/racism study, that's a pretty minor one to worry about though.

Continue the discussion bbs.boingboing.net

1 more reply