Data-mining sucks: official report

A multi-year National Research Council review of data-mining as a means of discovering terrorists has concluded that this just doesn't work very well, and that it ends up harming and harassing -- and terrorizing -- innocents whose only crime is to have a profile that some database-designer thinks is hinky.
The report was written by a committee whose members include William Perry, a professor at Stanford University; Charles Vest, the former president of MIT; W. Earl Boebert, a retired senior scientist at Sandia National Laboratories; Cynthia Dwork of Microsoft Research; R. Gil Kerlikowske, Seattle's police chief; and Daryl Pregibon, a research scientist at Google.

They admit that far more Americans live their lives online, using everything from VoIP phones to Facebook to RFID tags in automobiles, than a decade ago, and the databases created by those activities are tempting targets for federal agencies. And they draw a distinction between subject-based data mining (starting with one individual and looking for connections) compared with pattern-based data mining (looking for anomalous activities that could show illegal activities).

But the authors conclude the type of data mining that government bureaucrats would like to do--perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."

As a Slashdot poster says, "Can't we just go back to probable cause?" Government report: Data mining doesn't work well (via /.)

Update: Ennis sez, "That's Bill Perry, former SecDef from 93-97! It's not just some ivory tower analysis then .... "


  1. Come now! As a professional data miner for a major telecommunications company I can tell you that it doesn’t suck. In fact, it’s a whole lot better than working in an electronics factory, cleaning parking lots in the middle of the night, selling blood plasma, or any of the other things I’ve dome in the past to earn a living. The only down side is that you never seem to ever wash off all the data dust…

  2. When I was in graduate school, I had a statistics professor who had this posted on his office wall:
    “You can make your data say anything if you torture it long enough”.

    Words to live by.

  3. Yes!

    And this is why the spying program that’s splitting off internet backbone fibreoptics is so horribly wrong.

    The difference between someone who is monitored and someone who is not is just one bit. One boolean expression. “Does the system think you are worthy of observation?” is the question this expression asks.

    Now, a well trained human being could make a mistake on a question like that.

    I know a lot of people don’t really computer programming well, but the limits of the program are the same as the limits of the humans who write it. Their judgment is in effect applied to hundreds of thousands or more communications each day. False positives are inevitable, a given.

    Sometimes I wonder how many of my own posts throw up little flags in their system. Sigh.

  4. Am I the only one paranoid enough to immediately think that finding terrorists isn’t the goal of gov’t data mining?
    Wouldn’t politicians be tempted to find voting patterns in different populations, and try to manipulate and even disenfranchize certain groups of voters?
    Naaah. That would never happen!
    Back to obedience and mindless consumption!

  5. #5 —

    Okay, but the same expert is the subject of another (now, much more dated) article by Simson Garfinkel:

    “Mining data on mutilations, beatings, murders”

    He describes the use of what he calls data mining techniques to “ultimately draw a comprehensive portrait of the guilty” (here, those guilty of human rights abuses. Further, “Ironically, he uses many of the same database-mining techniques used by marketing firms to manipulate consumer opinion or by intelligence agencies to track the movements of dissidents.”

    The article made it sound like it was working in that application.

  6. The very best that data mining could accomplish if it worked would be to accumulate circumstantial evidence. And how many people have been convicted of crimes by just enough evidence to convince the cops or prosecutor that they are justified in rigging the outcome? If only more avenues than DNA testing were available to exonerate those railroaded by officials who “know best”, then we would understand what that optimum outcome, successful data mining, would give us. I believe it would give us nothing more than misplaced confidence to persecute those the algorithm fingers as guilty.
    And if you think that mere facts about the technique’s uselessness will stop law enforcement from employing it, let me direct your attention to the polygraph…

  7. You want to know why the current crop of spy agencies love data mining? It’s a bottom-less well of work into which can be thrown endless amounts of money and man-power. Even if there’s no particular use in the work, there’s never any lack of it, so you can always use the “we’re swamped” argument to justify bigger budgets and greater powers.

    It’s the perfect solution for being stuck with a job with no chance of success – i.e. fighting terrorism through spying instead of diplomacy. At the same time, you can continue to feed the NSA’s mighty shadow empire.

  8. Plus bureacrats can brag “after the fact” they found data on some terroist. And the bonus is you can sit on your ass and torture your data, as one person already said.
    Random data can be helpful sometime, However You still have to have real intelligence. Phone records don’t really prove anything.

    Vote or shut up – you make the choice

  9. Bragging after the fact is only good when you’ve captured someone. When you’re found to have incriminating or even suggestive data after the terrorist has crashed their plane, set off their bomb, or assassinated the President, then you look like you weren’t paying attention to your work.

  10. As someone who does data mining in academia, I have to lament the fact that it seems like folks don’t know that data mining is used for things other than spying and marketing. Data mining doesn’t suck in general – it sucks in this specific domain. Data mining is awesome in mining gene sequence data, analyzing documents, interpreting experiment data, etc.

    How does Netflix make recommendations for you? Data mining! How will folks at the LHC sort through the gigs of data they’ll create? Data mining!

    Data mining is just a tool. Whether or not it sucks is determined by who wields it, and what they use it on.

  11. But when has “it doesn’t work” stopped a government from doing anything? Have you seen any decline in the rate with which the British government is carpeting the landscape with surveillance cameras? Didn’t think so. And the present US administration has practically made “Everybody tells us it’s a bad idea, but we’re going to do it anyway” their motto.

    In any case, it’s not as if most law enforcement agencies would have a problem with the idea that they might end up reading the emails of a thousand innocent people in the course of trying to nail one bad guy. Nor will they have a problem with the fact that whatever technology they’re using was introduced to “protect us from terrorists”, but they’re using it to catch tax dodgers and people with unpaid parking tickets.

    In short, the fact that data mining has been convincingly shown to be ineffective won’t actually prevent it being deployed.

  12. THEY wield it, and they’re using it on us.

    Naomi Wolf’s new video is a must see.


    End of America Trailer

    Sarah Palin and the Police State

    “…Am I trying to scare you? I am. I am trying to scare you to death and ask you to scare your Republican and independent friends most of all. How do you know when it is war on citizens? When there are mass arrests, journalists are jailed, the opposition is infiltrated, rights are stripped and leaders start to ignore the rule of law.

    Almost everyone I work with on projects related to this campaign for liberty has been experiencing computer harassment: emails are stripped, messages disappear. That’s not all: people’s bank accounts are being tampered with: wire transfers to banks vanish in midair. I personally keep opening bank accounts that are quickly corrupted by fraud…”

  13. #8: Data mining is an over-used term, used both as a synonym for “numerical analysis” by press and as a pejorative term by statisticians meaning “fast uncritical analysis with a goal already in mind; prone to cherry-picking and unwarranted conclusions”; among other meanings.

    In short the reason the human rights guy isn’t doing the bad-kind-of-data-mining, is that he is keeping his conclusions in the aggregate. He’s not finding criminals, although his conclusions are useful there. He is rather showing that the data of civilian deaths/migration are inconsistent with random “chaotic” displacement. This is a conclusion about a population, which is where statistical arguments are born and have the most power.

    By contrast, the other kind of “data mining” which tries to pick individuals out of a population is clearly prone to an embarrassing number of false positives since randomness overwhelms the weak (rare) signal, as illustrated here:

  14. #19 I agree that the use of data mining for targeting citizens for any reason is a bad idea – any data mining researcher can tell you that data mining methods can never give accurate results all the time. My point was simply that blaming “data mining” for governmental tyranny is like blaming farmers for people choking on food.

  15. A good saying from russian Inet digest

    Paranoia is a myth that is created by them who trying to fool you!

  16. As someone who spends some of their professional time building and applying data mining (to business problems, not spying on you… chill) I have to agree. There are lots of great ways to use data mining when you have *reliable* data sets to work with and clearly definable & testable targets like say… sales totals. I can test and see if my pet theory data miner in-action increases sales totals. I cannot test if my pet-theory data miner in-action reduces terrorism. At least not with any usable level of reliability.

    These techniques typically don’t work so well when you’re working with gargantuan amounts of data from a wide number of data sources with widely varying accuracy… like say machine translated text of conversations, financial transactions, nationality information, etc. And if you want to lower your odds even more add a highly subjective & untestable goal like stopping terrorism.

    Sounds like pet theory city to me. I’m totally comfortable trying out pet theories on reasonably testable scenarios with low risk like in a business marketing campaign. I’m equally uncomfortable trying out a pet-theory that, if its wrong (which it will be to some degree even if I’ve got a brilliant idea) will violate peoples privacy and IMO potentially cause much more than a privacy violation.


Comments are closed.