Boing Boing 

Social graph of mysterious twitterbots


Terence Eden has mined the social graphs of thousands of mysterious, spammy twitterbots, which may or may not be the same larval spambots I wrote about.

Read the rest

Imaginary ISIS attack on Louisiana and the twitterbots who loved it


Gilad Lotan has spotted some pretty sophisticated fake-news generation, possibly from Russia, and possibly related to my weird, larval twitterbots, aimed at convincing you that ISIS had blown up a Louisiana chemical factory.

Read the rest

I Think You'll Find It's a Bit More Complicated Than That

Over the past decade, pharma-fighting Dr Ben Goldacre has written more than 500,000 words of fearlessly combative science journalism.Read the rest

Thousands of Americans got sub-broadband ISP service, thanks to telcoms shenanigans


Measurement Lab, an open, independent analysis organization devoted to measuring the quality of Internet connections and detecting censorship, technical faults and network neutrality violations, has released a major new report on how ISPs connect to one another, and it's not pretty.

Read the rest

Wouldn't it be great if a billboard could actually read your mind?

Said no one, ever. Except, apparently not: the "data scientists" of Posterscope are excited that EE -- a joint venture of T-Mobile and Orange -- will spy on all their users' mobile data to "give profound insights...that were never possible before"

Read the rest

Mercilessly pricking the bubbles of AI, Big Data, machine learning


Michael I Jordan is an extremely accomplished computer scientist who is also deeply skeptical of claims made by Big Data advocates as well as people who believe that machine intelligence, AI and machine vision are solved, or nearly so.

Read the rest

Ontario police's Big Data assigns secret guilt to people looking for jobs, crossing borders


There are no effective legal limits on when and to whom police can disclose unproven charges against you, 911 calls involving mental health incidents, and similar sensitive and prejudicial information; people have been denied employment, been turned back at the US border and suffered many other harms because Ontario cops send this stuff far and wide.

Read the rest

Microsoft says it won't use contents of emails to target ads

Alan sez, "Microsoft is pushing out an update to its privacy policies."

Read the rest

Big Data should not be a faith-based initiative

Cory Doctorow summarizes the problem with the idea that sensitive personal information can be removed responsibly from big data: computer scientists are pretty sure that's impossible.Read the rest

IRS won't fix database of nonprofits, so it goes dark


Rogue archivist Carl Malamud writes, "Due to inaction by the Internal Revenue Service and the U.S. Congress, Public.Resource.Org has been forced to terminate access to 7,634,050 filings of nonprofit organizations. The problem is that we have been fixing the database, providing better access mechanisms and finding and redacting huge numbers of Social Security Numbers. Our peers such as GuideStar are also fixing their copies of the database."

Read the rest

Inherent biases warp Big Data


The theory of Big Data is that the numbers have an objective property that makes their revealed truth especially valuable; but as Kate Crawford points out, Big Data has inherent, lurking bias, because the datasets are the creation of fallible, biased humans. For example, the data-points on how people reacted to Hurricane Sandy mostly emanate from Manhattan, because that's where the highest concentration of people wealthy enough to own tweeting, data-emanating smartphones are. But more severely affected locations -- Breezy Point, Coney Island and Rockaway -- produced almost no data because they had fewer smartphones per capita, and the ones they had didn't work because their power and cellular networks failed first.

I wrote about this in 2012, when Google switched strategies for describing the way it arrived at its search-ranking. Prior to that, the company had described its ranking process as a mathematical one and told people who didn't like how they got ranked that the problem was their own, because the numbers didn't lie. After governments took this argument to heart and started ordering Google to change its search results -- on the grounds that there's no free speech question if you're just ordering post-processing on the outcome of an equation -- Google started commissioning law review articles explaining that the algorithms that determined search-rank were the outcome of an expressive, human, editorial process that deserved free speech protection.

Read the rest

Anti-Net Neutrality Congresscritters made serious bank from the cable companies


The Congressmen who sent letters to the FCC condemning Net Neutrality received 2.3 times more campaign contributions from the cable industry than average. The analysis, conducted with Maplight's Congressional transparency tools, shows that Dems are cheaper to bribe than Republicans (GOP members received 5x the Congressional average from Big Cable; Dems only 1.2x) and shows what a chairmanship of a powerful committee is worth: Rep. Greg Walden (R-Ore.), who chairs the FCC-overseeing Subcommittee on Communications and Technology, got $109,250 (the average congressscritter got $11,651).

29 Congresscritters own stock in Comcast, and Comcast is the 25th most-held stock in Congress.

Read the rest

EFF on the White House's Big Data report: what about privacy and surveillance?

Last week, I wrote about danah boyd's analysis of the White House's Big Data report [PDF]. Now, the Electronic Frontier Foundation has added its analysis to the discussion. EFF finds much to like about the report, but raises two very important points:

* The report assumes that you won't be able to opt out of leaving behind personal information and implicitly dismisses the value of privacy tools like ad blockers, Do Not Track, Tor, etc

* The report is strangely silent on the relationship between Big Data and mass surveillance, except to the extent that it equates whistleblowers like Chelsea Manning and Edward Snowden with the Fort Hood shooter, lumping them all in as "internal threats"

Read the rest

Big Data analysis from the White House: understanding the debate


Danah boyd, founder of the critical Big Data think/do tank Data and Society, writes about the work she did with the White House on Big Data: Seizing Opportunities, Preserving Values [PDF]. Boyd and her team convened a conference called The Social, Cultural & Ethical Dimensions of "Big Data" (read the proceedings here), and fed the conclusions from that event back to the White House for its report.

In boyd's view, the White House team did good work in teasing out the hard questions about public benefit and personal costs of Big Data initiatives, and made solid recommendations for future privacy-oriented protections. Boyd points to this Alistair Croll quote as getting at the heart of one of Big Data's least-understood problems:

Perhaps the biggest threat that a data-driven world presents is an ethical one. Our social safety net is woven on uncertainty. We have welfare, insurance, and other institutions precisely because we can’t tell what’s going to happen — so we amortize that risk across shared resources. The better we are at predicting the future, the less we’ll be willing to share our fates with others.

Read the rest

Can you really opt out of Big Data?


Janet Vertesi, assistant professor of sociology at Princeton University, had heard many people apologize for commercial online surveillance by saying that people who didn't want to give their data away should just not give their data away -- they should opt out. So when she got pregnant, she and her husband decided to keep the fact secret from marketing companies (but not their friends and family). She quickly discovered that this was nearly impossible, even while she used Tor, ad blockers, and cash-purchased Amazon cards that paid for baby-stuff shipped to anonymous PO boxes.

Read the rest

Hipsterbait1: algorithmically generated post-ironic tees


Shardcore writes, "I've built a new bot to troll/delight hipsters. It algorithmically creates post-post-ironic t-shirt designs, posts them on twitter and tumblr and offers them for sale. No human is involved in the process at all."

Read the rest

Big Data has big problems


Writing in the Financial Times, Tim Harford (The Undercover Economist Strikes Back, Adapt, etc) offers a nuanced, but ultimately damning critique of Big Data and its promises. Harford's point is that Big Data's premise is that sampling bias can be overcome by simply sampling everything, but the actual data-sets that make up Big Data are anything but comprehensive, and are even more prone to the statistical errors that haunt regular analytic science.

What's more, much of Big Data is "theory free" -- the correlation is observable and repeatable, so it is assumed to be real, even if you don't know why it exists -- but theory-free conclusions are brittle: "If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down." Harford builds on recent critiques of Google Flu (the poster child for Big Data) and goes further. This is your must-read for today.

Read the rest

Big Data Hubris: Google Flu versus reality

In The Parable of Google Flu: Traps in Big Data Analysis [PDF], published in Science, researchers try to understand why Google Flu (which uses search history to predict flu outbreaks) performed so well at first but has not done well since. One culprit: people don't know what the flu is, so their search for "flu" doesn't necessarily mean they have flu. More telling, though, is that Google can't let outsiders see their data or replicate their findings, meaning that they can't get the critical review that might help them spot problems before years of failure. (via Hacker News)

Full NHS hospital records uploaded to Google servers, "infinitely worse" story to come

PA Consulting, a management consulting firm, obtained the entire English and Welsh hospital episode statistics database and uploaded it to Google's Bigquery service. The stats filled 27 DVDs and took "a couple of weeks" to transfer to Google's service, which is hosted in non-EU data centres. This is spectacularly illegal. The NHS dataset includes each patient's NHS number, post code, address, date of birth and gender, as well as all their inpatient, outpatient and emergency hospital records. Google's Bigquery service allows for full data-set sharing with one click.

The news of the breach comes after the collapse of a scheme under which the NHS would sell patient records to pharma companies, insurers and others (there was no easy way to opt out of the scheme, until members of the public created the independent Fax Your GP service).

According to researcher and epidemiologist Ben Goldacre, this story is just the beginning: there's an "infinitely worse" story that is coming shortly.

Read the rest

Weinberger's "Too Big to Know" in paperback

David Weinberger's 2012 book Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room was one of the smartest, most thought-provoking reads I had the pleasure of being buffeted by in 2012. I'm delighted to learn that it's out in paperback this month. Here's my original review from 2012:

David Weinberger is one of the Internet's clearest and cleverest thinkers, an understated and deceptively calm philosopher who builds his arguments like a bricklayer builds a wall, one fact at a time. In books like Everything is Miscellaneous and Small Pieces, Loosely Joined, he erects solid edifices with no gaps between the bricks, inviting conclusions that are often difficult to reconcile with your pre-existing prejudices, but which are even harder to deny.

Too Big to Know, Weinberger's latest book-length argument, is another of these surprising brick walls. Weinberger presents us with a long, fascinating account of how knowledge itself changes in the age of the Internet -- what it means to know something when there are millions and billions of "things" at your fingertips, when everyone who might disagree with you can find and rebut your assertions, and when the ability to be heard isn't tightly bound to your credentials or public reputation for expertise.

Read the rest

Chicago PD's Big Data: using pseudoscience to justify racial profiling


The Chicago Police Department has ramped up the use of its "predictive analysis" system to identify people it believes are likely to commit crimes. These people, who are placed on a "heat list," are visited by police officers who tell them that they are considered pre-criminals by CPD, and are warned that if they do commit any crimes, they are likely to be caught.

The CPD defends the practice, and its technical champion, Miles Wernick from the Illinois Institute of Technology, characterizes it as a neutral, data-driven system for preventing crime in a city that has struggled with street violence and other forms of crime. Wernick's approach involves seeking through the data for "abnormal" patterns that correlate with crime. He compares it with epidemiological approaches, stating that people whose social networks have violence within them are also likely to commit violence.

The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there.

This is a terrible way of running a criminal justice system.

Read the rest

Comic explains problems with Oakland's Domain Awareness Center surveillance plan


Hugh sez, "What's wrong with Oakland's proposed Domain Awareness Center? This new comic by Susie Cagle lays out the issues."

The Testing Ground for the New Surveillance (Thanks, Hugh!)

Fax Your GP: quick opt-out from insane NHS plan to sell your medical records


The UK National Health Service has initiated a plan to take the nation's private health records and sell them off to private companies in a process overseen by notorious multinational bumblewads ATOS. If you live in the UK England, your records -- mental health records, prescriptions, records of surgeries including abortions, and other sensitive personal information -- will be handed over to a wide-ranging group of companies all over the world.

Unless you opt out. And opting out isn't easy. There's no central place to opt out. Instead, you have to send a letter to your GP's surgery, which means you have to look up your GP's surgery's address, compose a legally sufficient letter, print it out, find an envelope and a stamp -- etc.

However! There's a better way. A group of volunteers whom I trust implicitly, including the astounding Stef Magdalinski (who made the Faxyourmp service that is the ancestor of Theyworkforyou) have created Fax Your GP, a dead-simple form that will look up your GP's fax number for you, create a form opt-out letter you can fill in in just a few easy steps, and then they'll fax that letter directly to your GP's surgery. I just opted out.

Read the rest

UK set to sell sensitive NHS records to commercial companies with no meaningful privacy protections - UPDATED

The UK government's Health and Social Care Information Centre quietly announced plans to share all patient records held by the National Health Service with private companies, from insurers to pharmaceutical companies. The information sharing is on an opt-out basis, so if you don't want your "clinical records, mental health consultations, drug addiction rehabilitation details, dsexual health clinic attendance and abortion procedures" shared, along with your "GP records, HS numbers, post-codes, gender, date of birth," you need to contact your doctor and opt out of the process.

This is a complex issue. Large data-sets are the lifeblood of epidemiology and evidence-based care and policy, and the desire to extract useful health information from this data is a legitimate one.

However, it's clear that no one involved in the process gives a damn about privacy. These data-sets -- which will be sold on the open market to commercial operators -- are "anonymized" and "pseudonymized" through processes that don't work, have never worked, and are well-documented to be without any basis in reality.

And that's the thing that brings the whole enterprise out of the realm of legitimate scientific project and into the realm of corporatist hucksterism. Once the architects of this project announced that its privacy protections would be based on junk science, they lost any claim they had to operating in good faith.

Effectively, the managers of this programme have said, "We can't figure out how to protect the most private, potentially damaging facts of your life, so we're not going to try." It is pure cynicism, and it makes me furious. It brings the whole field of evidence-based medicine into disrepute. It is a scandal. And as it goes ahead, it will spectacularly destroy the lives of random people in the UK through the involuntary, totally foreseeable disclosure of health information, in ways that make the general public leery of any participation in this kind of inquiry.

If you set about to discredit the open data movement, you could do no better than this.


Update: As if that wasn't bad enough, Noemi adds, "The contract for handling and managing the care data has been given to ATOS. This is the same company whose disability benefit assessment has been found to be flawed and unacceptable in 40% of cases by the Audit Commission." Here's more.

Read the rest

Officemax sends junkmail addressed to "Daughter Killed In Car Crash"


Officemax sent junkmail to Mike Seay at his address in Lindenhurst, IL, with the notation "Daughter Killed In Car Crash" under his name. Seay's 17 year old daughter was killed in a crash last year. Officemax says it bought Seay's name from a marketing company, and implies that the company had made the notation in its list. It's not clear what marketing purpose this information was intended for (is there a sub-list for "bereaved parents" that's rented out to grief counselors looking for business?) or whether this was a one-off in a data-entry department.

Seay is understandably very upset. The Officemax call-center person he spoke to refused to believe him, as did an official spokesdroid. He's seeking an apology from Officemax's CEO.

Read the rest

Judge rules that NSA metadata surveillance is constitutional

U.S. District Judge William Pauley of New York, a Clinton appointee, has ruled (PDF) that the bulk-collection of metadata by the NSA and the phone companies is Constitutional. He called it a "vital tool" for fighting terrorism, and pooh-poohed claims that it was invasive, in part because people "voluntarily" give their data to large corporations. The suit was brought by the ACLU, and was dismissed by Pauley at government request. The ACLU will appeal.

Earlier this month, a different federal judge ruled that NSA spying was illegal. It was likely from the start that that case would go to the Supreme Court, but that likelihood just shot up now that there's a circuit split brewing among the federal courts.

Judge Pauley's ruling advanced the theory that mass spying detects "relationships so attenuated and ephemeral they would otherwise escape notice," though there's no evidence that this "attenuated relationship detection" leads to any useful counterterrorism -- and there's an abundance of evidence that it generates thousands and thousands of false positives: people judged guilty by a secret and unaccountable algorithm.

Pauley has subscribed to the NSA's Greater Manure Pile theory of crimefighting ("If the pile of manure is big enough, there must be a pony underneath it somewhere!"). The fact that the evidence in support of the Greater Manure Pile is secret means that its advocates can simply wink and lay their fingers alongside their noses and say "If you only knew what I knew..." and then ask for another billion dollars for their own surveillance empires.

Both rulings -- in support of, and against NSA spying -- cite Smith v. Maryland, a Supreme Court case that held that spying on one person's phone-metadata for a limited time was legal in order to catch a purse-snatcher. A secret interpretation of Smith was used by the Obama administration and the NSA to justify harvesting all phone metadata, of all people, all the time. Judge Pauley agreed that this was a reasonable interpretation. The ACLU disagreed: "[The decision] misinterprets the relevant statutes, understates the privacy implications of the government’s surveillance and misapplies a narrow and outdated precedent to read away core constitutional protections."

Read the rest

NYC think-tank devoted to critical analysis of Big Data seeks fellows

Outstanding social scientist danah boyd has founded a new thinktank (or "think/do-tank") called The Data & Society Research Institute, based in New York City, and devoted to critical analysis of big data, and "social, technical, ethical, legal, and policy issues that are emerging because of data-centric technological development." It's well-funded, with an exciting mission, and they're hiring.

Read the rest

Understanding spurious correlation in data-mining


Last May, Dave at Euri.ca took at crack at expanding Gabriel Rossman's excellent post on spurious correlation in data. It's an important read for anyone wondering whether the core hypothesis of the Big Data movement is that every sufficiently large pile of horseshit must have a pony in it somewhere. As O'Reilly's Nat Torkington says, "Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this."

Read the rest

Big Data meets Bigfoot

NewImage

Big Data meets Bigfoot in Penn State PhD candidate Joshua Stevens's visualization of nearly a century of Sasquatch sighting reports in the US and Canada. Stevens mapped and graphed more than 3,000 sightings included in the Bigfoot Field Researchers Organizations's database of geocoded and timestamped reports. Stevens writes:

Right away you can see that sightings are not evenly distributed. At first glance, it looks a lot like a map of population distribution. After all, you would expect sightings to be the most frequent in areas where there are a lot of people. But a bivariate view of the data shows a very different story. There are distinct regions where sightings are incredibly common, despite a very sparse population. On the other hand, in some of the most densely populated areas sasquatch sightings are exceedingly rare.
"‘Squatch Watch: 92 Years of Bigfoot Sightings in the US and Canada" (Thanks, everyone!)

Can has data-optimized cheeseburger? Yes.

Here's an Ignite talk by Hilary Mason, chief scientist at Bitly, explaining how she scraped data from multiple sources to create a service that locates NYC's most optimal cheeseburger, using an algorithm that balances out price, proximity, and sentiment analysis from various review sites. As Mason points out, this isn't about cheeseburgers, really: it's about the power (and limits) of cross-referenced data.

Read the rest