NSA and GCHQ's crappy Big Data techniques may be killing thousands of innocents

Researchers have taken a second look at the NSA SKYNET leaks, as well as the GCHQ data-mining problem book first published on Boing Boing, and concluded that the spy agencies have made elementary errors in their machine-learning techniques, which are used to identify candidates for remote assassination by drone.

These errors reveal the fundamental problem with secret science: that scientists will forgive their own corner-cutting and sloppiness when they know no one will ever check their work.

At root is the lack of good training data to use to establish "ground truths" for the data-mining technology. The techniques documented in the leaks show the researchers taking shortcuts to get around this lack — rather than holding back some known-terrorist profiles to test their models, they re-run the training data back through the system. This is an absolute no-no in machine-learning.

The system is geared to prefer false positives (innocent people who get killed) over false negatives (guilty people who go free). But the assumptions the NSA makes about how frequently these false positives occur are based on the idea that the training and modelling is done correctly. The shortcuts in the model-generation mean that the false-positive rate will be much higher, though.

If 50 percent of the false negatives (actual "terrorists") are allowed to survive, the NSA's false positive rate of 0.18 percent would still mean thousands of innocents misclassified as "terrorists" and potentially killed. Even the NSA's most optimistic result, the 0.008 percent false positive rate, would still result in many innocent people dying.

"On the slide with the false positive rates, note the final line that says '+ Anchory Selectors,'" Danezis told Ars. "This is key, and the figures are unreported… if you apply a classifier with a false-positive rate of 0.18 percent to a population of 55 million you are indeed likely to kill thousands of innocent people. [0.18 percent of 55 million = 99,000]. If however you apply it to a population where you already expect a very high prevalence of 'terrorism'—because for example they are in the two-hop neighbourhood of a number of people of interest—then the prior goes up and you will kill fewer innocent people."

Besides the obvious objection of how many innocent people it is ever acceptable to kill, this also assumes there are a lot of terrorists to identify. "We know that the 'true terrorist' proportion of the full population is very small," Ball pointed out. "As Cory [Doctorow] says, if this were not true, we would all be dead already. Therefore a small false positive rate will lead to misidentification of lots of people as terrorists."

"The larger point," Ball added, "is that the model will totally overlook 'true terrorists' who are statistically different from the 'true terrorists' used to train the model."

The NSA's SKYNET program may be killing thousands of innocent people
[Christian Grothoff & J.M. Porup/Ars Technica]