FJ Anscome's classic, oft-cited 1973 paper "Graphs in Statistical Analysis" showed that very different datasets could produce "the same summary statistics (mean, standard deviation, and correlation) while producing vastly different plots" -- Anscome's point being that you can miss important differences if you just look at tables of data, and these leap out when you use graphs to represent the same data. (more…)

]]>FJ Anscome's classic, oft-cited 1973 paper "Graphs in Statistical Analysis" showed that very different datasets could produce "the same summary statistics (mean, standard deviation, and correlation) while producing vastly different plots" -- Anscome's point being that you can miss important differences if you just look at tables of data, and these leap out when you use graphs to represent the same data. (more…)

]]>If you've read Cathy O'Neil's Weapons of Math Destruction (you should, right NOW), then you know that machine learning can be a way to apply a deadly, nearly irrefutable veneer of objectivity to our worst, most biased practices. (more…)

]]>If you've read Cathy O'Neil's Weapons of Math Destruction (you should, right NOW), then you know that machine learning can be a way to apply a deadly, nearly irrefutable veneer of objectivity to our worst, most biased practices. (more…)

]]>Andrew Hacker, a professor of both mathematics and political science at Queens University has a new book out, The Math Myth: And Other STEM Delusions, which makes the case that the inclusion of algebra and calculus in high school curriculum discourages students from learning mathematics, and displaces much more practical mathematical instruction about statistical and risk literacy, which he calls "Statistics for Citizenship." (more…)

]]>Andrew Hacker, a professor of both mathematics and political science at Queens University has a new book out, The Math Myth: And Other STEM Delusions, which makes the case that the inclusion of algebra and calculus in high school curriculum discourages students from learning mathematics, and displaces much more practical mathematical instruction about statistical and risk literacy, which he calls "Statistics for Citizenship." (more…)

]]>
The gold standard for researching the effects of diet on health is the self-reported food-diary, which is prone to lots of error, underreporting of "bad" food, and changes in diet that result from simply keeping track of what you're eating. The standard tool for correcting these errors comparisons with *more* self-reported tests.
(more…)

The gold standard for researching the effects of diet on health is the self-reported food-diary, which is prone to lots of error, underreporting of "bad" food, and changes in diet that result from simply keeping track of what you're eating. The standard tool for correcting these errors comparisons with *more* self-reported tests.
(more…)

David Cameron wants social media companies to invent a terrorism-detection algorithm and send all the "bad guys" it detects to the police -- but this will fall prey to the well-known (to statisticians) "paradox of the false positive," producing tens of thousands of false leads that will drown the cops. (more…)

]]>David Cameron wants social media companies to invent a terrorism-detection algorithm and send all the "bad guys" it detects to the police -- but this will fall prey to the well-known (to statisticians) "paradox of the false positive," producing tens of thousands of false leads that will drown the cops. (more…)

]]>The eye-popping stat comes from Philip J Cook's 2007 booze-economics book Paying the Tab. (more…)]]>

The eye-popping stat comes from Philip J Cook's 2007 booze-economics book Paying the Tab. (more…)]]>

The Spurious Correlations engine helps you discover bizarre and delightful spurious correlations, and collects some of the most remarkable ones. For example, Per capita consumption of sour cream (US) correlates with Motorcycle riders killed in noncollision transport accident at the astounding rate of 0.916391. Meanwhile, but exploring the engine, I've discovered a surprising correlation between the Age of Miss America and Murders by steam, hot vapours and hot objects (a whopping 0.870127!).

Spurious Correlations
(*via Waxy*)
]]>

The Spurious Correlations engine helps you discover bizarre and delightful spurious correlations, and collects some of the most remarkable ones. For example, Per capita consumption of sour cream (US)
correlates with
Motorcycle riders killed in noncollision transport accident at the astounding rate of 0.916391. Meanwhile, but exploring the engine, I've discovered a surprising correlation between the Age of Miss America and Murders by steam, hot vapours and hot objects (a whopping 0.870127!).

Spurious Correlations
(*via Waxy*)
]]>

Writing in the Financial Times, Tim Harford (The Undercover Economist Strikes Back, Adapt, etc) offers a nuanced, but ultimately damning critique of Big Data and its promises. Harford's point is that Big Data's premise is that sampling bias can be overcome by simply sampling *everything*, but the actual data-sets that make up Big Data are anything but comprehensive, and are even more prone to the statistical errors that haunt regular analytic science.

What's more, much of Big Data is "theory free" -- the correlation is observable and repeatable, so it is assumed to be real, even if you don't know why it exists -- but theory-free conclusions are brittle: "If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down." Harford builds on recent critiques of Google Flu (the poster child for Big Data) and goes further. This is your must-read for today. (more…)

]]>Writing in the Financial Times, Tim Harford (The Undercover Economist Strikes Back, Adapt, etc) offers a nuanced, but ultimately damning critique of Big Data and its promises. Harford's point is that Big Data's premise is that sampling bias can be overcome by simply sampling

What's more, much of Big Data is "theory free" -- the correlation is observable and repeatable, so it is assumed to be real, even if you don't know why it exists -- but theory-free conclusions are brittle: "If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down." Harford builds on recent critiques of Google Flu (the poster child for Big Data) and goes further. This is your must-read for today. (more…)

]]>Charles writes, "It's hard to imagine how we would have gotten all of the whiz-bang technology we enjoy today without the discovery of probability and statistics. From vaccines to the Internet, we owe a lot to the probabilistic revolution, and every great revolution deserves a great story!

"The Fields Institute for Research in Mathematical Sciences has partnered up with the American Statistical Association in launching a speculative fiction competition that calls on writers to imagine a world where the Normal Curve had never been discovered. Stories will be following in the tradition of Gibson and Sterling's steampunk classic, The Difference Engine, in creating an imaginative alternate history that sparks the imagination. The winning story will receive a $2000 grand prize, with an additional $1500 in cash available for youth submissions."

What would the world be like if the Normal Curve had never been discovered? (*Thanks, Charles!*)
]]>

Charles writes, "It's hard to imagine how we would have gotten all of the whiz-bang technology we enjoy today without the discovery of probability and statistics. From vaccines to the Internet, we owe a lot to the probabilistic revolution, and every great revolution deserves a great story!

"The Fields Institute for Research in Mathematical Sciences has partnered up with the American Statistical Association in launching a speculative fiction competition that calls on writers to imagine a world where the Normal Curve had never been discovered. Stories will be following in the tradition of Gibson and Sterling's steampunk classic, The Difference Engine, in creating an imaginative alternate history that sparks the imagination. The winning story will receive a $2000 grand prize, with an additional $1500 in cash available for youth submissions."

What would the world be like if the Normal Curve had never been discovered? (*Thanks, Charles!*)
]]>

The Pew Internet and American Life project has released a new report on reading, called E-Reading Rises as Device Ownership Jumps. It surveys American book-reading habits, looking at both print books and electronic books, as well as audiobooks. They report that ebook readership is increasing, and also produced a "snapshot" (above) showing readership breakdown by gender, race, and age. They show strong reading affinity among visible minorities and women, and a strong correlation between high incomes and readership. The most interesting number for me is that 76 percent of Americans read at least one book last year, which is much higher than I'd have guessed.

E-Reading Rises as Device Ownership Jumps
(*via Jim Hines*)
]]>

The Pew Internet and American Life project has released a new report on reading, called E-Reading Rises as Device Ownership Jumps. It surveys American book-reading habits, looking at both print books and electronic books, as well as audiobooks. They report that ebook readership is increasing, and also produced a "snapshot" (above) showing readership breakdown by gender, race, and age. They show strong reading affinity among visible minorities and women, and a strong correlation between high incomes and readership. The most interesting number for me is that 76 percent of Americans read at least one book last year, which is much higher than I'd have guessed.

E-Reading Rises as Device Ownership Jumps
(*via Jim Hines*)
]]>

Last May, Dave at Euri.ca took at crack at expanding Gabriel Rossman's excellent post on spurious correlation in data. It's an important read for anyone wondering whether the core hypothesis of the Big Data movement is that every sufficiently large pile of horseshit must have a pony in it somewhere. As O'Reilly's Nat Torkington says, "Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this."
(more…)

Last May, Dave at Euri.ca took at crack at expanding Gabriel Rossman's excellent post on spurious correlation in data. It's an important read for anyone wondering whether the core hypothesis of the Big Data movement is that every sufficiently large pile of horseshit must have a pony in it somewhere. As O'Reilly's Nat Torkington says, "Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this." (more…)]]>

These two young fellows are brothers from Palo Alto who've set out to produce a series of videos explaining the technical ideas in my novel Little Brother, and their first installment, explaining Bayes's Theorem, is a very promising start. I'm honored -- and delighted!

Technology behind "Little Brother" - Jamming with Bayes Rule
]]>

These two young fellows are brothers from Palo Alto who've set out to produce a series of videos explaining the technical ideas in my novel Little Brother, and their first installment, explaining Bayes's Theorem, is a very promising start. I'm honored -- and delighted!

Technology behind "Little Brother" - Jamming with Bayes Rule
]]>

Alex Reinhart's Statistics Done Wrong: The woefully complete guide is an important reference guide, right up there with classics like How to Lie With Statistics. The author has kindly published the whole text free online under a CC-BY license, with an index. It's intended for people with no stats background and is extremely readable and well-presented. The author says he's working on a new edition with new material on statistical modelling.
(more…)

Alex Reinhart's Statistics Done Wrong: The woefully complete guide is an important reference guide, right up there with classics like How to Lie With Statistics. The author has kindly published the whole text free online under a CC-BY license, with an index. It's intended for people with no stats background and is extremely readable and well-presented. The author says he's working on a new edition with new material on statistical modelling. (more…)]]>

On the

So if you want to know when Google Keep, opened for business on 21 March 2013, will probably shut - again, assuming Google decides it's just not working - then, the mean suggests the answer is: 18 March 2017. That's about long enough for you to cram lots of information that you might rely on into it; and also long enough for Google to discover that, well, people aren't using it to the extent that it hoped. Much the same as happened with Knol (lifespan: 1,377 days, from 23 July 2008 to 30 April 2012), or Wave (1,095 days, from May 2009 - 30 April 2012) or of course Reader (2,824 days, from 7 October 2005 to 1 July 2013).If you want to play around further with the numbers, then if we assume that closures occur randomly as a normal distribution around the mean, and that Google is going to shut Google Keep, then there's a 68% chance that the closure will occur between April 2015 and February 2019. Even the later date wouldn't be much longer than Evernote - which is still growing - has already lasted. Is Google really that committed to Keep?

Google Keep? It'll probably be with us until March 2017 - on average
(*via /.*)]]>

On the *Guardian*, Charles Arthur has totted up the lifespan of 39 products and services that Google has killed off in the past due to insufficient public interest. One interesting finding is that Google is becoming less patient with its less popular progeny, with an accelerating trend to killing off products that aren't cutting it. This was occasioned by the launch of Google Keep, a networked note-taking app which has the potential to become quite central to your workflow, and to be quite disruptive if Google kills it -- much like Google Reader, which is scheduled for impending switch-off.

So if you want to know when Google Keep, opened for business on 21 March 2013, will probably shut - again, assuming Google decides it's just not working - then, the mean suggests the answer is: 18 March 2017. That's about long enough for you to cram lots of information that you might rely on into it; and also long enough for Google to discover that, well, people aren't using it to the extent that it hoped. Much the same as happened with Knol (lifespan: 1,377 days, from 23 July 2008 to 30 April 2012), or Wave (1,095 days, from May 2009 - 30 April 2012) or of course Reader (2,824 days, from 7 October 2005 to 1 July 2013).If you want to play around further with the numbers, then if we assume that closures occur randomly as a normal distribution around the mean, and that Google is going to shut Google Keep, then there's a 68% chance that the closure will occur between April 2015 and February 2019. Even the later date wouldn't be much longer than Evernote - which is still growing - has already lasted. Is Google really that committed to Keep?

Google Keep? It'll probably be with us until March 2017 - on average
(*via /.*)]]>

Jim Saska is a jerky cyclist, something he cheerfully cops to (he also admits that he's a dick when he's driving a car or walking, and explains the overall pattern with a reference to his New Jersey provenance). But he's also in possession of some compelling statistics that suggest that cyclists are, on average, less aggressive and safer than they were in previous years, that the vast majority of cyclists are very safe and cautious, and that drivers who view cycling as synonymous with unsafe behavior have fallen prey to a cognitive bias that isn't supported by empirical research.

The fact is, unlike me, most bicyclists are courteous, safe, law-abiding citizens who are quite willing and able to share the road. The Bicycle Coalition of Greater Philadelphia studied rider habits on some of Philly’s busier streets, using some rough metrics to measure the assholishness of bikers: counting the number of times they rode on sidewalks or went the wrong way on one-way streets. The citywide averages in 2010 were 13 percent for sidewalks and 1 percent for one-way streets at 12 locations where cyclists were observed, decreasing from 24 percent and 3 percent in 2006. There is no reason to believe that Philly has particularly respectful bicyclists—we’re not a city known for respectfulness, and our disdain for traffic laws is nationally renowned. Perhaps the simplest answer is also the right one: Cyclists are getting less aggressive.

A recent study by researchers at Rutgers and Virginia Tech supports that hypothesis. Data from nine major North American cities showed that, despite the total number of bike trips tripling between 1977 and 2009, fatalities per 10 million bike trips fell by 65 percent. While a number of factors contribute to lower accident rates, including increased helmet usage and more bike lanes, less aggressive bicyclists probably helped, too...

...[Y]our estimate of the number of asshole cyclists and the degree of their assholery is skewed by what behavioral economists like Daniel Kahneman call the affect heuristic, which is a fancy way of saying that people make judgments by consulting their emotions instead of logic.

The affect heuristic explains how our minds take a difficult question (one that would require rigorous logic to answer) and substitutes it for an easier one. When our emotions get involved, we jump to pre-existing conclusions instead of exerting the mental effort to think of a bespoke answer. The affect heuristic helps explain why birthers still exist even though Obama released his birth certificate—it’s a powerful, negative emotional issue about which lots of people have already made up their minds. When it comes to cyclists, once some clown on two wheels almost kills himself with your car, you furiously decide that bicyclists are assholes, and that conclusion will be hard to shake regardless of countervailing facts, stats, or arguments.

Why You Hate Cyclists
(*via Skepchick*)

(*Image: Cyclists Sign, a Creative Commons Attribution (2.0) image from kecko's photostream*)
]]>

Jim Saska is a jerky cyclist, something he cheerfully cops to (he also admits that he's a dick when he's driving a car or walking, and explains the overall pattern with a reference to his New Jersey provenance). But he's also in possession of some compelling statistics that suggest that cyclists are, on average, less aggressive and safer than they were in previous years, that the vast majority of cyclists are very safe and cautious, and that drivers who view cycling as synonymous with unsafe behavior have fallen prey to a cognitive bias that isn't supported by empirical research.

The fact is, unlike me, most bicyclists are courteous, safe, law-abiding citizens who are quite willing and able to share the road. The Bicycle Coalition of Greater Philadelphia studied rider habits on some of Philly’s busier streets, using some rough metrics to measure the assholishness of bikers: counting the number of times they rode on sidewalks or went the wrong way on one-way streets. The citywide averages in 2010 were 13 percent for sidewalks and 1 percent for one-way streets at 12 locations where cyclists were observed, decreasing from 24 percent and 3 percent in 2006. There is no reason to believe that Philly has particularly respectful bicyclists—we’re not a city known for respectfulness, and our disdain for traffic laws is nationally renowned. Perhaps the simplest answer is also the right one: Cyclists are getting less aggressive.

A recent study by researchers at Rutgers and Virginia Tech supports that hypothesis. Data from nine major North American cities showed that, despite the total number of bike trips tripling between 1977 and 2009, fatalities per 10 million bike trips fell by 65 percent. While a number of factors contribute to lower accident rates, including increased helmet usage and more bike lanes, less aggressive bicyclists probably helped, too...

...[Y]our estimate of the number of asshole cyclists and the degree of their assholery is skewed by what behavioral economists like Daniel Kahneman call the affect heuristic, which is a fancy way of saying that people make judgments by consulting their emotions instead of logic.

The affect heuristic explains how our minds take a difficult question (one that would require rigorous logic to answer) and substitutes it for an easier one. When our emotions get involved, we jump to pre-existing conclusions instead of exerting the mental effort to think of a bespoke answer. The affect heuristic helps explain why birthers still exist even though Obama released his birth certificate—it’s a powerful, negative emotional issue about which lots of people have already made up their minds. When it comes to cyclists, once some clown on two wheels almost kills himself with your car, you furiously decide that bicyclists are assholes, and that conclusion will be hard to shake regardless of countervailing facts, stats, or arguments.

Why You Hate Cyclists
(*via Skepchick*)

(*Image: Cyclists Sign, a Creative Commons Attribution (2.0) image from kecko's photostream*)
]]>
*Nature Neuroscience* by Sander Nieuwenhuis and co, points out an important and fatal statistical error common to many peer-reviewed neurology papers (as well as papers in related disciplines). Of the papers surveyed, the error occurred in more than half the papers where it could occur. Ben Goldacre explains the error:

]]>Let’s say you’re working on some nerve cells, measuring the frequency with which they fire. When you drop a chemical on them, they seem to fire more slowly. You’ve got some normal mice, and some mutant mice. You want to see if their cells are differently affected by the chemical. So you measure the firing rate before and after applying the chemical, first in the mutant mice, then in the normal mice.

When you drop the chemical on the mutant mice nerve cells, their firing rate drops, by 30%, say. With the number of mice you have (in your imaginary experiment) this difference is statistically significant, which means it is unlikely to be due to chance. That’s a useful finding which you can maybe publish. When you drop the chemical on the normal mice nerve cells, there is a bit of a drop in firing rate, but not as much – let’s say the drop is 15% – and this smaller drop doesn’t reach statistical significance.

But here is the catch. You can say that there is a statistically significant effect for your chemical reducing the firing rate in the mutant cells. And you can say there is no such statistically significant effect in the normal cells. But you cannot say that mutant cells and mormal cells respond to the chemical differently. To say that, you would have to do a third statistical test, specifically comparing the “difference in differences”, the difference between the chemical-induced change in firing rate for the normal cells against the chemical-induced change in the mutant cells.

]]>Let’s say you’re working on some nerve cells, measuring the frequency with which they fire. When you drop a chemical on them, they seem to fire more slowly. You’ve got some normal mice, and some mutant mice. You want to see if their cells are differently affected by the chemical. So you measure the firing rate before and after applying the chemical, first in the mutant mice, then in the normal mice.

When you drop the chemical on the mutant mice nerve cells, their firing rate drops, by 30%, say. With the number of mice you have (in your imaginary experiment) this difference is statistically significant, which means it is unlikely to be due to chance. That’s a useful finding which you can maybe publish. When you drop the chemical on the normal mice nerve cells, there is a bit of a drop in firing rate, but not as much – let’s say the drop is 15% – and this smaller drop doesn’t reach statistical significance.

But here is the catch. You can say that there is a statistically significant effect for your chemical reducing the firing rate in the mutant cells. And you can say there is no such statistically significant effect in the normal cells. But you cannot say that mutant cells and mormal cells respond to the chemical differently. To say that, you would have to do a third statistical test, specifically comparing the “difference in differences”, the difference between the chemical-induced change in firing rate for the normal cells against the chemical-induced change in the mutant cells.