Ontario police's Big Data assigns secret guilt to people looking for jobs, crossing borders


There are no effective legal limits on when and to whom police can disclose unproven charges against you, 911 calls involving mental health incidents, and similar sensitive and prejudicial information; people have been denied employment, been turned back at the US border and suffered many other harms because Ontario cops send this stuff far and wide.

Read the rest

Microsoft says it won't use contents of emails to target ads

Alan sez, "Microsoft is pushing out an update to its privacy policies."

Read the rest

Big Data should not be a faith-based initiative

Cory Doctorow summarizes the problem with the idea that sensitive personal information can be removed responsibly from big data: computer scientists are pretty sure that’s impossible.

Read the rest

IRS won't fix database of nonprofits, so it goes dark


Rogue archivist Carl Malamud writes, "Due to inaction by the Internal Revenue Service and the U.S. Congress, Public.Resource.Org has been forced to terminate access to 7,634,050 filings of nonprofit organizations. The problem is that we have been fixing the database, providing better access mechanisms and finding and redacting huge numbers of Social Security Numbers. Our peers such as GuideStar are also fixing their copies of the database."

Read the rest

Inherent biases warp Big Data


The theory of Big Data is that the numbers have an objective property that makes their revealed truth especially valuable; but as Kate Crawford points out, Big Data has inherent, lurking bias, because the datasets are the creation of fallible, biased humans. For example, the data-points on how people reacted to Hurricane Sandy mostly emanate from Manhattan, because that's where the highest concentration of people wealthy enough to own tweeting, data-emanating smartphones are. But more severely affected locations -- Breezy Point, Coney Island and Rockaway -- produced almost no data because they had fewer smartphones per capita, and the ones they had didn't work because their power and cellular networks failed first.

I wrote about this in 2012, when Google switched strategies for describing the way it arrived at its search-ranking. Prior to that, the company had described its ranking process as a mathematical one and told people who didn't like how they got ranked that the problem was their own, because the numbers didn't lie. After governments took this argument to heart and started ordering Google to change its search results -- on the grounds that there's no free speech question if you're just ordering post-processing on the outcome of an equation -- Google started commissioning law review articles explaining that the algorithms that determined search-rank were the outcome of an expressive, human, editorial process that deserved free speech protection.

Read the rest

Anti-Net Neutrality Congresscritters made serious bank from the cable companies


The Congressmen who sent letters to the FCC condemning Net Neutrality received 2.3 times more campaign contributions from the cable industry than average. The analysis, conducted with Maplight's Congressional transparency tools, shows that Dems are cheaper to bribe than Republicans (GOP members received 5x the Congressional average from Big Cable; Dems only 1.2x) and shows what a chairmanship of a powerful committee is worth: Rep. Greg Walden (R-Ore.), who chairs the FCC-overseeing Subcommittee on Communications and Technology, got $109,250 (the average congressscritter got $11,651).

29 Congresscritters own stock in Comcast, and Comcast is the 25th most-held stock in Congress.

Read the rest

EFF on the White House's Big Data report: what about privacy and surveillance?

Last week, I wrote about danah boyd's analysis of the White House's Big Data report [PDF]. Now, the Electronic Frontier Foundation has added its analysis to the discussion. EFF finds much to like about the report, but raises two very important points:

* The report assumes that you won't be able to opt out of leaving behind personal information and implicitly dismisses the value of privacy tools like ad blockers, Do Not Track, Tor, etc

* The report is strangely silent on the relationship between Big Data and mass surveillance, except to the extent that it equates whistleblowers like Chelsea Manning and Edward Snowden with the Fort Hood shooter, lumping them all in as "internal threats"

Read the rest

Big Data analysis from the White House: understanding the debate


Danah boyd, founder of the critical Big Data think/do tank Data and Society, writes about the work she did with the White House on Big Data: Seizing Opportunities, Preserving Values [PDF]. Boyd and her team convened a conference called The Social, Cultural & Ethical Dimensions of "Big Data" (read the proceedings here), and fed the conclusions from that event back to the White House for its report.

In boyd's view, the White House team did good work in teasing out the hard questions about public benefit and personal costs of Big Data initiatives, and made solid recommendations for future privacy-oriented protections. Boyd points to this Alistair Croll quote as getting at the heart of one of Big Data's least-understood problems:

Perhaps the biggest threat that a data-driven world presents is an ethical one. Our social safety net is woven on uncertainty. We have welfare, insurance, and other institutions precisely because we can’t tell what’s going to happen — so we amortize that risk across shared resources. The better we are at predicting the future, the less we’ll be willing to share our fates with others.

Read the rest

Can you really opt out of Big Data?


Janet Vertesi, assistant professor of sociology at Princeton University, had heard many people apologize for commercial online surveillance by saying that people who didn't want to give their data away should just not give their data away -- they should opt out. So when she got pregnant, she and her husband decided to keep the fact secret from marketing companies (but not their friends and family). She quickly discovered that this was nearly impossible, even while she used Tor, ad blockers, and cash-purchased Amazon cards that paid for baby-stuff shipped to anonymous PO boxes.

Read the rest

Hipsterbait1: algorithmically generated post-ironic tees


Shardcore writes, "I've built a new bot to troll/delight hipsters. It algorithmically creates post-post-ironic t-shirt designs, posts them on twitter and tumblr and offers them for sale. No human is involved in the process at all."

Read the rest

Big Data has big problems


Writing in the Financial Times, Tim Harford (The Undercover Economist Strikes Back, Adapt, etc) offers a nuanced, but ultimately damning critique of Big Data and its promises. Harford's point is that Big Data's premise is that sampling bias can be overcome by simply sampling everything, but the actual data-sets that make up Big Data are anything but comprehensive, and are even more prone to the statistical errors that haunt regular analytic science.

What's more, much of Big Data is "theory free" -- the correlation is observable and repeatable, so it is assumed to be real, even if you don't know why it exists -- but theory-free conclusions are brittle: "If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down." Harford builds on recent critiques of Google Flu (the poster child for Big Data) and goes further. This is your must-read for today.

Read the rest

Big Data Hubris: Google Flu versus reality

In The Parable of Google Flu: Traps in Big Data Analysis [PDF], published in Science, researchers try to understand why Google Flu (which uses search history to predict flu outbreaks) performed so well at first but has not done well since. One culprit: people don't know what the flu is, so their search for "flu" doesn't necessarily mean they have flu. More telling, though, is that Google can't let outsiders see their data or replicate their findings, meaning that they can't get the critical review that might help them spot problems before years of failure. (via Hacker News) Cory 2

Full NHS hospital records uploaded to Google servers, "infinitely worse" story to come

PA Consulting, a management consulting firm, obtained the entire English and Welsh hospital episode statistics database and uploaded it to Google's Bigquery service. The stats filled 27 DVDs and took "a couple of weeks" to transfer to Google's service, which is hosted in non-EU data centres. This is spectacularly illegal. The NHS dataset includes each patient's NHS number, post code, address, date of birth and gender, as well as all their inpatient, outpatient and emergency hospital records. Google's Bigquery service allows for full data-set sharing with one click.

The news of the breach comes after the collapse of a scheme under which the NHS would sell patient records to pharma companies, insurers and others (there was no easy way to opt out of the scheme, until members of the public created the independent Fax Your GP service).

According to researcher and epidemiologist Ben Goldacre, this story is just the beginning: there's an "infinitely worse" story that is coming shortly.

Read the rest

Weinberger's "Too Big to Know" in paperback

David Weinberger's 2012 book Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room was one of the smartest, most thought-provoking reads I had the pleasure of being buffeted by in 2012. I'm delighted to learn that it's out in paperback this month. Here's my original review from 2012:

David Weinberger is one of the Internet's clearest and cleverest thinkers, an understated and deceptively calm philosopher who builds his arguments like a bricklayer builds a wall, one fact at a time. In books like Everything is Miscellaneous and Small Pieces, Loosely Joined, he erects solid edifices with no gaps between the bricks, inviting conclusions that are often difficult to reconcile with your pre-existing prejudices, but which are even harder to deny.

Too Big to Know, Weinberger's latest book-length argument, is another of these surprising brick walls. Weinberger presents us with a long, fascinating account of how knowledge itself changes in the age of the Internet -- what it means to know something when there are millions and billions of "things" at your fingertips, when everyone who might disagree with you can find and rebut your assertions, and when the ability to be heard isn't tightly bound to your credentials or public reputation for expertise.

Read the rest

Chicago PD's Big Data: using pseudoscience to justify racial profiling


The Chicago Police Department has ramped up the use of its "predictive analysis" system to identify people it believes are likely to commit crimes. These people, who are placed on a "heat list," are visited by police officers who tell them that they are considered pre-criminals by CPD, and are warned that if they do commit any crimes, they are likely to be caught.

The CPD defends the practice, and its technical champion, Miles Wernick from the Illinois Institute of Technology, characterizes it as a neutral, data-driven system for preventing crime in a city that has struggled with street violence and other forms of crime. Wernick's approach involves seeking through the data for "abnormal" patterns that correlate with crime. He compares it with epidemiological approaches, stating that people whose social networks have violence within them are also likely to commit violence.

The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there.

This is a terrible way of running a criminal justice system.

Read the rest