Over-surveillance makes it harder to fight crime

My latest Guardian column "Surveillance: You can know too much," explains how collecting too much information on innocent people makes it harder to catch guilty ones:
At a certain point, data gathered to predict the weather overwhelms your capacity to add it to your calculations efficiently, resulting in ever-longer runtimes that give less accurate predictions. It's better to crunch the data needed to calculate tomorrow's weather in 10 minutes (and refine your guess twice an hour) than to shovel so much data into the hopper that you don't get tomorrow's forecast until next week.

The sweet spot lies somewhere between gathering too much information and gathering too little – and the secret to hitting that spot is intelligent, discriminating data-acquisition.

Take London: cover every square inch of the city with CCTVs and you'll get so much information that you'll never make any sense of it. Scotland Yard says that CCTVs help solve fewer than 3% of all crimes, while a study in San Francisco found that at best, criminals simply move out of camera range, while at worst they assume no one is watching.

Similarly, if you take fingerprints from every person who applies for a visa – or worse still, from every person in Britain who has to carry one of the proposed new biometric cards – you will fill the databases with chaff that slows down searches, generates endless false matches, and threatens everyone in the database with the worst kind of identity theft.



  1. so, i should stop spying on all my neighbors, and just focus my attention on one?
    pretty sure its the hairy one, northeast side. sick dude, his dogs poop ALL over my front yard.

  2. Though I’m strongly opposed to invasive government surveillance, I have to wonder about the validity of this argument.

    The claim seems to be that if you collect data overzealously, you a) increase the time to process your query, and b) increase the likelihood of false positives.

    For a), it seems that’s a pretty trivial problem, with the most obvious solution being “bigger computers”. Surely it’s possible to crunch through a larger data set, it’s just hard for the current hardware. More expensive tech, or future tech can easily solve that problem.

    For b), false positives can be reduced by failing to collect data, or by collecting it but filtering that data out in your query. Most such filters are difficult or ineffective now only due to the speed or storage limitations of particular systems. Basically, false positives can be reduced by improving the queries, which again, is a relatively trivial problem.

    My concern is that these kinds of objections are easy to resolve, and suggest that opposition to invasive surveillance stems from a complaint about it “not being good enough”. And that’s not really an intractable problem. After all, even if our data mining techniques are crude now, there’s no harm in gathering the data for some eventuality when we have fast enough computers and good enough queries to mine it effectively….

  3. As security expert Bruce Schneier has said, when you’re looking for a needle it a haystack, it doesn’t help to add more hay.

  4. @4, that’s a great metaphor, and very illustrative when it’s appropriate, but I’m not sure it’s relevant to this situation. That metaphor presupposes that there’s one target you’re looking for, and it’s sure to be in the original dataset. Therefore, adding additional data would be nonsensical.

    In reality, there is only a small chance that your target is in the original dataset, and an even smaller (ok, really, REALLY small) chance that it’s in an expanded dataset. However, there’s still a possibility, which means that in reality, adding more “hay” does help, because it’s possible that the needle was in the hay you added.

    Again, I’m not arguing in favor of this kind of surveillance, I’m just saying this isn’t a good way to argue against it.

  5. There’s a simple solution to the ‘false positive’ problem: pass enough laws so that everyone is guilty of something, and then anyone you investigate will be a criminal.

    “No one is innocent, citizen. We are merely here to determine the level of your guilt.” [Judge Dredd]

  6. The anti-profilers chickens have come home to roost. Let’s not use targeted information to catch those we know empirically to be more likely to commit criminal acts, that would be unfair and God forbid we be unfair. Lets watch everybody — that will work. The British deserve it — sadly.

  7. @8: Regardless of the moral question, profiling of the kind you’re suggesting (presumably based on class, race, religion, ethnicity, age, style of dress, etc.) is not effective. It has been demonstrated to be ineffective quite thoroughly, and people only continue to believe in it because of their own prejudices regarding the aforementioned characteristics of race, religion, class, etc.

    @7: You say that the cameras installed in local schools are OK because they’re intended as a documentation system, not a security system. The intentions of surveillance is always presented as benign, and if they were always used benignly there would likely be no problem. The outstanding question is whether the authorities in control of the surveillance can be trusted to use it fairly and responsibly.

  8. Similarly, when Steve Faucett was lost in the Nevada desert, the higher resolution of the imagery meant that the number of frames to be searched went way up, consequently slowing everything down.
    Makes me wonder why the awareness needed to locate threats is often called “intelligence” since almost inevitably those who never seem to exhibit any come up with the rules on how to generate it.

  9. Another aspect of the over-watched society is how it pushes the truly desperate criminals (feeding a habit, stealing to survive) to be more violent, more extreme. They have ‘nothing to lose’ and end up fighting some poor clerk or teller they’re robbing over a DVR or video tape. That’s where people get hurt and killed where they would have just been robbed.

    3 percent!?! Aside from the civi liberties shredding, that’s an awful return on investment for the millions spent putting up these electro-eyes.

  10. Yup, tyranny of numbers and false positives. Our AI isn’t very I, yet. Like you point out in “Little Brother”, leading algorithms on massive, distributed wild goose chases is far too easy. I guess we just have to count on the Futility of It All to eventually make its own point. But I’m not holding my breath.

  11. #3: if you think better computers will fix these problems then you’ve never actually talked to security types. The problem exists between desk and chair with this bunch. Abstract, creative and critical thinking are all highly discouraged in security personnel, because all rules are good rules and there for a good reason. always.

    Evidence on how these things can be bad exist in the facts of several government agencies in Britain, Canada, and the US and other places losing huge amounts of personal info (and fingerprints can be faked, so access to inncent people’s fingerprints would be useful). If they can lose medical and credit info they can lose this.

    And for false positives one need look no further than the TSA no fly list, which would appear by all evidence to be no more than a first name and a last name with all similar names being flagged. Resulting in everyone from toddlers to war heros to security personnel being refused access to planes.

    We were safer ten years ago.

  12. #6 Angusm:

    They managed that one pretty well already with Sacco and Vanzetti.

    As we’ve seen with the public photos situation, there are few consequences for authorities breaking laws… though they’ll look pretty stupid if they keep it up.

  13. In the lab where I work, we sometimes run into an analogous problem in doing the calculations to reconstruct evolutionary trees from DNA sequence data. Our postdoc actually had to reduce the number of samples he used in the analysis for a paper we’ve just had accepted.

    All the same, I imagine that even if Homeland Security acknowledged this problem, their solution would be to build bigger computers, or try to upgrade the AI, rather than collect more efficient intelligence.

  14. The problem is the quality of the data and analytic techniques, not the quantity. Bigger, faster computers won’t help sift through garbage. People who collect data rarely give sufficient thought to how it will be used.

    @9: Your view on profiling sounds made-up to me. Reviewing the published literature on the subject shows that broad use (e.g., traffic stops) may not be effective. However, I see no evidence that it is ineffective for highly specific use. This information is extremely helpful when used to predict behavior in all kinds of areas: consumer behavior, health behavior, etc. So the idea that it is helpful in predicting criminal behavior is not unreasonable. Whether those with the data have the capability to use it effectively is an entirely separate issue (and where the true concern should be).

  15. This jibes well with Malcolm Gladwell’s writings in BLINK, where he discussed the fact that in some studies the accuracy of a doctor’s diagnosis DECREASED as more information (via tests) was provided.

    Of course a competent analysis also has to look at the difference in consequences of diagnosis failures, or in a case like this the consequence of not having that information gathering.

    Which will never happen in our OMG MUST TAPE EVERYTHING culture, because looking into that would require confronting the fact that (a) CCTV shows no indicating of decreasing crime, merely moving it and (2) we’d have to look into how rarely it is helpful considering the equipment and labor costs. That doesn’t even confront the lower quality of life (privacy) and opportunity costs of having law enforcement watching crime on the tube rather than in person, where they might do some good.

  16. @ Zikzak:

    The major reason why too much data can be a bad thing is in differentiation. As more data is added to the set, it’s harder to tell one piece from another — there are greater chances that two pieces of data will share more in common, so much so that we won’t be able to reliably tell them apart.

    If you were asked to pick a person out of a 10-man lineup, you would have no difficulty in doing so. Assuming they were picked at random, odds are that each would be different heights, have different hair color, different facial features, etc. Now try to pick that same person out of a 10 billion-man lineup. Sure, you can rule out most people, but you’ll eventually find that there is a point at which all of the remaining candidates look awfully similar — so much so that you can’t reliably tell who is who.

    Introducing enough points into the data set changes it from a discrete set of points to a continuous spectrum. At some point our resolution (our ability to discern between adjacent points) it no longer sufficient. Current methods, such as fingerprints, facial recognition, signature recognition, and many others, have dismally low resolution — especially when automated by computers.

    As a numerical example, consider how reliable your detection mechanism would have to be in order to reach an acceptable level of false-positives (note that you can never eliminate false-positives). Take, as an example, a recognition process with a 0.001% false positive rate — much better than most current recognition processes. That means that if the data set is 100,000 people, there’s a 100% chance of getting a false positive (0.001% chance of a false positive at each comparison * 100,000 comparisons). Is this acceptable? What is acceptable? Studies often use 5% as the margin of error that they falsely conclude their hypothesis. If we accept that as our margin of error for proving an identity (keeping in mind that this means 1 out of every 20 tests identifies the wrong person), we could only do that for a population of 5,000 people. Just how accurate would we have to be to identity a single person out of 10 billion with less than 5% possibility of picking the wrong person? (1 / 10 billion) * 5% or 0.0000000005%. That’s a very small number, far more accurate than any process we have to date, and that’s assuming an abismal 5% false positive rate.

    The importance of false-positives is that they mean innocent people are going to be inconvenienced. And I use the word inconvenienced lightly, because that inconvenience could run anywhere from being pulled over for 10 minutes by police, to being questioned at a police station for several hours, to being held for questioning for several days, to being held in a containment camp for several months, to fighting a court battle for over a year, to being imprisoned for 25 years or more. It all depends on just how much faith we put in automated recognition.

  17. If you were asked to pick a person out of a 10-man lineup, you would have no difficulty in doing so. […] Now try to pick that same person out of a 10 billion-man lineup.

    Ok, but if we knew the guy we wanted was in the lineup of 10, then we wouldn’t have to bother with a lineup of 10 billion. A real use case doesn’t work like that, though. We don’t have a small pool of people which we magically know contains our target. We don’t start with anything but a “signature” to search on – in your example a physical description – and a database. If that’s the only signature we have, anyone who matches that description is equally likely to be the target. If there’s a lot of them and we can’t tell them apart, we can either refine our query or gather a better signature to help differentiate the “suspects”. Collecting less data won’t make our search any more accurate.

    Unless, of course we could somehow not collect data on people who aren’t our target, and only collect data on people who are our target. The trouble with that should be self-evident.

    We can use the signature we have to create a more narrow pool, for example “only people who have criminal records” or “only people who shopped at Bob’s Electronics on Tuesday”, but that still requires a larger pool to draw from. This is accomplished by crafting a good query, which filters out the data which is irrelevant to our search.

    Shrinking the pool of data does accomplish one thing: it makes it more likely that we’ll find one and only one match for our query. But it does not guarantee that the one match we find is actually the target.

    What I’m talking about here is theoretical. Obviously the current surveillance overlords use sloppy queries and very poor signatures. However, this is not a fundamental problem with surveillance, it’s a problem with government incompetence. And while incompetence is bad and should be criticized, the real reason surveillance is such a danger is not because of incompetence, but actual malice on the part of the various institutions who have control over the surveillance apparatus.

    Those institutions would be glad to deflect criticism of the encroaching police state by improving the effectiveness and success rate of their surveillance measures – upgrading hardware, training security personnel better, etc. But this will in fact make us less free, so it seems counterproductive to be complaining about the surveillance being insufficiently sophisticated.

  18. Ok, but if we knew the guy we wanted was in the lineup of 10, then we wouldn’t have to bother with a lineup of 10 billion.

    I believe that the Japanese have now optimized their line-ups to only one suspect. Efficient.

  19. Information theory: if you have enough data your predictions will be more accurate. 100% observation of someone would eliminate them as a suspect in any crime (unless they could fool the system). If you have 100% data on the weather you understand exactly what’s happening at the moment, and probably into the next few micro-seconds. We don’t predict crime via observational statistics, not like we do weather. I think that if you believe too much data mucks up the system, it’s because someone’s not using it correctly. Too much data? When do we stop science? When do we get too much? Watching people is just social science, data collection that can be used for good or bad. That’s always the choice you have to make, not a choice you have to avoid.

  20. Another example of this effect: I was recently commended by the police for some photos I took of a crime from my big city apartment window. They suggested that I set up a webcam in my window. What the cops didn’t note was that there was already a city surveillance camera on that street. The reason that the cops thought I should set up a webcam was that I was the one who brought the captured images out to them in the field. If more surveillance systems did as I did, walking the images of a crime over to the officers investigating it, that would be much more effective than simply installing more surveillance systems.

  21. If that was in San Francisco, the city cameras are set up not to be observed in real time by anyone. I suspect the city could do better by encouraging the criminals to take pictures of themselves and post them to Myspace.

  22. Read Blink if you don’t agree with Cory’s argument. One chapter talks about a study done in which a case file for a patient was given to a number of psychologists to determine the patient’s problem. The more information on the patient, the further they got from the actual problem. Another chapter talks about a professor who studies married couples and can determine with high accuracy the likelihood of them getting a divorce just by watching a 3 minute video of them in the middle of a debate/argument. He’s looking for a few “red flags.”

    In other words, it’s not the quantity of data you have on someone, nor is it necessarily the quality: it’s the relevance.

  23. has anyone done the calculation as to what percentage of the general American population is the maximal optimum to imprison? I assume all these cameras exist to swell the prison population. How will they know when they have won? You can’t have EVERYONE behind bars obviously, but what is the best amount to have to sustain the Prison Industry and to provide free labour? There are over 7,000,000 Americans in jail,on parole or one probation. 2,200,000 of these actual behind bars. One in six(?) for marijuana.

    I really am genuinely curious as to the economics of it all. Is it better and easier to jail a large segment than to find them work? CCTV is just a small part of a vast feeder system to fill up the prisons. I think they need to get properly organized on this.

  24. #24, busydoingnothing:

    Those anecdotes are accurate for the way that a human mind processes information. That’s probably the reason this idea of “too much information makes it harder to find what you’re looking for” seems so intuitive – for human beings, it’s very true.

    It’s not true for computers though, or at least a LOT less true. If I’m searching for a value in a database table, it doesn’t much matter whether the database has 1,000 rows or 100,000 rows. It matters marginally, in terms of disk speed and processor time, but that’s trivially solved by buying more disks or processors.

    Obviously there’s a cutoff point. Trillions of terabytes of data will be significantly slower to datamine than 1 gig, but the point is the cutoff is so high as to be mostly irrelevant. If a large amount of data is making your queries less effective, it’s almost certainly because your data-mining methods need to be improved.

  25. nothing wrong with having to much data. After you have selected who you wish to convict, you can search all the files by their name and image until you have the necessary “justification”

  26. Zikzak, do you work for a computer manufacturer? Your solution to a problem you insist does not exist seems to be to buy more hardware.

  27. Takuan, as long as a society continues to push its childhood into ever-increasing decades, the need to be babysat seems to be an obvious requirement. Humans do not require a sense of privacy when in public-space. If that sense is required then that’s an issue to deal with while in therapy. The hive mind requires less privacy, not more.

  28. zikzak, that’s all well and good, but computers aren’t to a point where they can mimic human nature 1:1 in realtime. Until that day, it’s still up to human beings to make decisions as to what data is relevant and what isn’t, and who to pursue and who not to.

    Humans don’t operate in binary. We’re not black and white.

  29. Ken, it depends on what crimes you have on the statute books and how loosely defined they are. See, for example, Mugabe’s locking-up of opposition leaders. Now, you don’t need a computerised surveillance society to do that, but such a society is one of the pathways that makes slipping into a tyranny harder to avoid.

    As with Zikzak’s analysis, it all breaks down at the human point of contact. All we need is a society of infallible human beings, and everything will be okay. But we aren’t infallible, and neither are those who we appoint to take responsibility for our governance. Engineering better hardware is the trivial issue; engineering a better society is the challenge.

  30. This is completely incorrect. Having more data is *always* a good thing. With appropriate theory and sufficient data we can make much better models. Parsimony, not starvation, is the key.

    As security expert Bruce Schneier has said, when you’re looking for a needle it a haystack, it doesn’t help to add more hay

    This is also complete balls. It sounds snappy and must be the kind of thing news outlets would lap up. Now OBVIOUSLY if we’re blindly groping around for a positive, then adding more negatives will reduce our conditional probability of success. But we don’t get to pick how much hay is in the haystack. We collect some data, see if it supports our ideas, and then adjust both accordingly. But if less data was always enough, we’d just look at gender and be done with it.

  31. We collect some data, see if it supports our ideas

    Or we could try analyzing the data without a preconceived notion about the results. Nah.

  32. prosecuting people costs money. One doesn’t expend that kind of resource without a good reason.

Comments are closed.