Features Podcasts Family Video Comics Music Tech Science Books Film & TV Games

Publishing should fight ebook retailers for more data

I've got a guest column in the new edition of The Bookseller, the trade magazine for the UK publishing industry. It's called "Tangible Assets," and it points out that of all the fights that publishing has had with the ebook sector -- DRM, pricing, promotion -- the one they've missed is access to data. Whatever else is going on with publishers and Amazon, Google, Apple, et al, the fact that publishing knows almost nothing about its ebook customers and has no realtime view into its ebook sales; and that the ebook channel knows almost everything, instantaneously, is untenable and unsustainable.

I just came off a US tour for my YA novel Homeland, which Tor Teen published in the US in February, and which Titan will publish this coming September in the UK. I went to 23 cities in 25 days, a kind of bleary and awesome whirlwind where I got to see friends from across the USA—Internet People to a one—for about 8.5 minutes each, in a caffeinated, exhausted rush.

Inevitably, I had this conversation: "How's the book doing?" and I got to say: "Oh, awesome! It's a New York Times and Indienet bestseller!" (It stayed on the NYT list for four weeks, so I got to say this a lot). And then, always: "So, how many copies does that come out to?" And my answer was always, "No one knows."

This is where the Internet People began to boggle. "No one knows?"

"Oh, there's some Nielsen reporting from the tills of participating booksellers—you can get that if you spend a fortune. But there's no realtime e-book numbers given to the publishers. We'll all find out exactly how the book performed in a couple of months."

And that's where they lost their minds. The irate squawks that emerged from their throats were audible for miles. "You mean Amazon, Apple and Google knows exactly who comes to their stores, how they find their way to your books, where they're coming in from, how many devices they use and when, and they don't tell the publishers?"

Tangible assets

Big Data: A Revolution That Will Transform How We Live, Work, and Think


Big Data is a new book from Viktor Mayer-Schonberger, a respected Internet governance theorist; and Kenneth Cukier, a long-time technology journalist who's been on the Economist for many years. As the title and pedigree imply, this is a business-oriented book about "Big Data," a computational approach to business, regulation, science and entertainment that uses data-mining applied to massive, Internet-connected data-sets to learn things that previous generations weren't able to see because their data was too thin and diffuse.

Big Data is an eminently practical and sensible book, but it's also an exciting and excitable text, one that conveys enormous enthusiasm for the field and its fruits. The authors use well-chosen examples to show how everything from shipping logistics to video-game design to healthcare stand to benefit from studying the whole data-set, rather than random samples. They even pose this as a simple way of thinking of big data versus "small data." Small data relies on statistical sampling, and emphasises the reliability and accuracy of each measurement. With big data, you sample the entire pool of activities -- all the books sold, all the operations performed -- and worry less about inaccuracies and anomalies in individual measurements, because these are drowned out by the huge numbers of observations performed.

As you'd expect, Big Data is particularly fascinating when it explores the business implications of all this: the changing leverage between firms that own data versus the firms that know how to make sense of it, and why sometimes data is best processed by unaffiliated third parties who can examine data from rival firms and find out things from which all parties stand to benefit, but which none of them could have discovered on their own. They also cover some of the bigger Big Data business blunders through history -- companies whose culture blinkered them to the opportunities in their data, which were exploited by clever rivals.

The last fifth of the book is dedicated to issues of governance, regulation, and public policy. This is some of the most interesting material in the book and probably needs to be expanded into its own volume. As it is, there's a real sense that the authors are just scraping the surface. For example, many of the stories told in the book have deep privacy implications, and the authors make a point of touching on these, cabining them with phrases like "so long as the data is anonymized" or "adhering to privacy policy, of course." But in the final third, the authors examine the transcendental difficulty of real-world anonymization, and the titanic business blunders committed by firms that believed they'd stripped out the personal information from the data, only to have the data "de-anonymized" and their customers' privacy invaded in small and large ways. These two facts -- that many of the opportunities require effective anonymization and that no one knows how to do anonymization -- are a pretty big stumbling block in the world of Big Data, but the authors don't explicitly acknowledge the conundrum.

While Big Data is an excellent primer on the opportunities of the field, it's thin on the risks, overall. For example, Big Data is rightly fascinated with stories about how we can look at data sets and find predictors of consequential things: for example, when Google mined its query-history and compared it with CDC data on flu outbreaks, it found that it could predict flu outbreaks ahead of the CDC, which is amazingly useful. However, all those search-strings were entered by people who didn't expect to have them mined for subsequent action. If searching for "scratchy throat" and "runny nose" gets your neighborhood quarantined (or gets it extra healthcare dollars), you might get all your friends to search on those terms over and over -- or not at all. Google knows this -- or it should -- because when it started measuring the number of links between sites to define the latent authority of different parts of the Internet, it got great results, but immediately triggered a whole scummy ecosystem of linkfarms and other SEO tricks that create links whose purpose is to produce more of the indicators Google is searching for.

Another important subject is looking at algorithmic prediction in domains where the outcome is punishment, instead of reward. British Airways may get great results from using an algorithm to pick out passengers for upgrades, trying to find potential frequent fliers. But we should be very cautious about applying the same algorithm to building the TSA's No-Fly list. If BA's algorithm fails 20% of the time, it just means that a few lucky people get to ride up front of the plane. If the TSA has a 20% failure rate, it means that one in five "potential terrorists" is an innocent whose fundamental travel rights have been compromised by a secretive and unaccountable algorithm.

Secrecy and accountability are the third important area for examination in a Big Data world. Cukier and Mayer-Schonberger propose a kind of inspector-general for algorithms who'll make sure they're not corrupted to punish the undeserving or line someone's pockets unjustly. But they also talk about the fact that these algorithms are likely to be illegible -- the product of a continuously evolving machine-learning system -- and that no one will be able to tell you why a certain person was denied credit, refused insurance, kept out of a university, or blackballed for a choice job. And when you get into a world where you can't distinguish between an algorithm that gets it wrong because the math is unreliable (a "fair" wrong outcome) from an algorithm that gets it wrong because its creators set out to punish the innocent or enrich the undeserving, then we can't and won't have justice. We know that computers make mistakes, but when we combine the understandable enthusiasm for Big Data's remarkable, counterintuitive recommendations with the mysterious and oracular nature of the algorithms that produce those conclusions, then we're taking on a huge risk when we put these algorithms in charge of anything that matters.

Big Data: A Revolution That Will Transform How We Live, Work, and Think

Previously: Book about big data, predictive behavior, and decision making

Raytheon making social-network-mining software to help gov'ts spy on citizens

Raytheon's "RIOT" (Rapid Information Overlay Technology) is intended to help governments all over the world by providing a "Google for spies" that mines multiple online sources to build up detailed pictures of the personal activities of their citizens:

The sophisticated technology demonstrates how the same social networks that helped propel the Arab Spring revolutions can be transformed into a "Google for spies" and tapped as a means of monitoring and control.

Using Riot it is possible to gain an entire snapshot of a person's life – their friends, the places they visit charted on a map – in little more than a few clicks of a button.

In the video obtained by the Guardian, it is explained by Raytheon's "principal investigator" Brian Urch that photographs users post on social networks sometimes contain latitude and longitude details – automatically embedded by smartphones within so-called "exif header data."

Riot pulls out this information, showing not only the photographs posted onto social networks by individuals, but also the location at which the photographs were taken.

"We're going to track one of our own employees," Urch says in the video, before bringing up pictures of "Nick," a Raytheon staff member used as an example target. With information gathered from social networks, Riot quickly reveals Nick frequently visits Washington Nationals Park, where on one occasion he snapped a photograph of himself posing with a blonde haired woman.

"We know where Nick's going, we know what Nick looks like," Urch explains, "now we want to try to predict where he may be in the future."

Riot can display on a spider diagram the associations and relationships between individuals online by looking at who they have communicated with over Twitter. It can also mine data from Facebook and sift GPS location information from Foursquare, a mobile phone app used by more than 25 million people to alert friends of their whereabouts. The Foursquare data can be used to display, in graph form, the top 10 places visited by tracked individuals and the times at which they visited them.

The video shows that Nick, who posts his location regularly on Foursquare, visits a gym frequently at 6am early each week. Urch quips: "So if you ever did want to try to get hold of Nick, or maybe get hold of his laptop, you might want to visit the gym at 6am on a Monday."

The associated patent says that Raytheon believes that its software can judge whether its subjects constitute a "security risk"

Software that tracks people on social media created by defence firm [Guardian/Ryan Gallagher]

ISP blinkenlights synchronized to a sprightly piano

Here's a lovely video shot at the XS4ALL ISP data-center outside of Amsterdam, in which the many twinkling, blinking lights are synchronized to a sprightly piano score.

De achterkant van het Internet (Thanks, Neils!)

How Twitter figures out the world with machine intelligence and Mechanical Turks

On Twitter's engineering blog, a fascinating description of how Twitter uses a blend of machine intelligence and Mechanical Turk tasks to figure out, in real time, what is going on in the world:

Before we delve into the details, here's an overview of how the system works.

  1. First, we monitor for which search queries are currently popular.
    Behind the scenes: we run a Storm topology that tracks statistics on search queries.
    For example, the query [Big Bird] may suddenly see a spike in searches from the US.

  2. As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
    Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon's Mechanical Turk service, and then polls Mechanical Turk for a response.
    For example: as soon as we notice "Big Bird" spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

  3. Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Improving Twitter search with real-time human computation (via Waxy)

Probability theory for programmers


Jeremy Kun, a mathematics PhD student at the University of Illinois in Chicago, has posted a wonderful primer on probability theory for programmers on his blog. It's a subject vital to machine learning and data-mining, and it's at the heart of much of the stuff going on with Big Data. His primer is lucid and easy to follow, even for math ignoramuses like me.

For instance, suppose our probability space is \Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \} and f is defined by setting f(x) = 1/6 for all x \in \Omega (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset \left \{ 2, 4, 6 \right \}, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Probability Theory — A Primer

(Image: Dice, a Creative Commons Attribution (2.0) image from artbystevejohnson's photostream)

Civil rights implications of Big Data

An excellent editorial by Alistair Croll on the civil rights implications of Big Data contains a number of points I hadn't considered before, as well as great analysis of the way that the Big Data situation arrived:

“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you — right? That’s just better service.

In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit:

Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”

We’re seeing the start of this slippery slope everywhere from tailored credit-card limits like this one to car insurance based on driver profiles. In this regard, big data is a civil rights issue, but it’s one that society in general is ill-equipped to deal with.

We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned — some of it was a bit too controversial). They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.

Big data is our generation’s civil rights issue, and we don’t know it (Thanks, Bruce!)

Open-data Cities Conference in Brighton, England: turning municipal governments into open data collaborators

Adam sez, "The first Open-data Cities Conference takes place in Brighton, England next week. It's aimed at local councils and government agencies who want to open up more of their datasets, and giving them ideas and practical help on how to do it. There's some good speakers, including Tom Steinberg from MySociety and Rufus Pollock from the Open Knowledge Foundation."

The high-profile conference – the first of its kind in the United Kingdom – will focus on how publicly-funded organisations can engage with citizens to build more creative, prosperous and accountable communities.

It will be attended by more than 200 people who believe the value of public data is greatest when it is freely and openly shared. They will be leaders from the public sector, arts and cultural organisations, and creative and digital industries.

The focus will be on the opportunities to improve the lives of more than 10 million citizens in the UK’s biggest cities.

Open-data Cities Conference (Thanks, Adam!)

Bundled, Buried & Behind Closed Doors: documentary on the net's hidden physical infrastructure

Ben sez, "I want to share a short documentary that I recently produced about the hidden Infrastructure of the Internet called Bundled, Buried and Behind Closed Doors. The video is meant to remind viewers that the Internet is a physical, geographically anchored thing. It features a tour inside Telx's 9th floor Internet exchange at 60 Hudson Street in New York City, and explores how this building became one of the world's most concentrated hubs of Internet connectivity."

Lower Manhattan’s 60 Hudson Street is one of the world’s most concentrated hubs of Internet connectivity. This short documentary peeks inside, offering a glimpse of the massive material infrastructure that makes the Internet possible.

Featuring interviews with Stephen Graham, Saskia Sassen, Dave Timmes of Telx, Rich Miller of datacenterknowledge.com, Stephen Klenert of Atlantic Metro Communications, and Josh Wallace of the City of Palo Alto Utilities.

Bundled, Buried & Behind Closed Doors (Thanks, Ben!)

The Revolution Will Be Digitised: how Cablegate, Facebook, Google and the regulation will shape the future

Heather Brooke is the American-trained "data journalist" who upended British politics when she moved to the UK and began to use the UK's Freedom of Information law to prise apart the dirty secrets of power and privilege, most notably by exposing the expense cheating by Members of Parliament. Brooke's latest book is The Revolution will be Digitised: Dispatches from the Information War, a history of her involvement in the Wikileaks cable-dumps and a meditation on the meaning and role of data-driven journalism in the coming years, as governments ramp up their attempts to lock down the Internet, and journalists, hackers, and activists attempt to open things further.

Brooke is uniquely situated to produce this analysis as someone who was both part of the Cablegate dump and someone who reported on it. She documents her odd and sometimes unpleasant dealings with Assange as well, but the Assange story isn't the most important aspect of Cablegate or this book, and Brooke's focus is thankfully on the broader narrative. This isn't another book that treats the Wikileaks phenomenon as a cult-of-personality story revolving around one person.

Brooke journeys to the hacker scenes in Berlin, San Francisco and Boston, and the radicalized halls of power in Iceland, and spins a story that does a good job of explaining what, exactly, happened with Cablegate: how the cables got out, the intrigues and infighting amongst the players (media, hackers, activists) and the governmental spin in response.

Here is one place where Brooke really opened my eyes: there are many people who make blanket assertions about the US government's manipulation of the press. But Brooke has concrete details, and the surprising intelligence that while the US does not have a "public broadcaster" like the BBC or public newspaper subsidies like Norway, it outspends both of them in its formidable press-offices at every level of government and military. In other words, the US doesn't have public news media, but it spends an equivalent sum on spin-doctors whose job it is to control the narrative in the "free-enterprise" press.

Brooke finishes the book with a manifesto of sorts, a call to arms to press, politicos and public to confront the coming deluge of data and channel it for transparency and accountability, but away from surveillance and invasion of privacy (a delicate operation, to be sure!) and to resist using the net as an excuse for more intrusive information policy. The book's website has more on this.

Talk on the privacy bargain, big data, and human sensors versus human barcodes

Here's the video from the talk I gave last week at the O'Reilly Strata conference on "big data" in NYC. The talk is called "Designing for Human Sensors, Not Human Barcodes," and it talks about the philosophy underpinning the "privacy bargain" we strike online when we trade personal information for access to services.

Big Data and privacy

Earlier this week, I gave a talk on the way that "Big Data" is underpinned with a kind of myth about how users trade privacy for services. Ciara Byrne from the NYT's VentureBeat interviewed me afterwards about it. I think she did a really good job of condensing a hard, nuanced question into a brief and informative article.

Cory coming to Toronto, Ann Arbor, Brooklyn and NYC

Hey, Torontonians, Ann Arborites, and New Yorkers!

I'll be giving a free talk at the Art Gallery of Ontario in Toronto called "Can creativity and freedom peacefully co-exist in the Internet age?" on Sept 14 at 7PM, where I'll be reprising my SIGGRAPH talk from August.

On Sept 15, I'll be in Ann Arbor, MI for the Penny Stamps Lecture Series, doing a panel called "On Futurology: Optimism And Failure" with Mark Stevenson and James King.

I head to New York next. First I'll be at the Brooklyn Book Festival on September 18, appearing on a 1200h panel called "Genres Crashers" with Jewell Parker Rhodes, Kelly Link and Stephanie Anderson.

Finally, I'm keynoting the O'Reilly Strata conference on September 20 at 1330h, with a talk called "Designing For Human Sensors, Not Human Barcodes."

Hope to see you there!