# Aaron Swartz's unfinished monograph on the "programmable Web"

Michael B. Morgan, CEO of Morgan & Claypool Publishers, writes:

In 2009, we invited Aaron Swartz to contribute a short work to our series on Web Engineering (now The Semantic Web: Theory and Technology). He produced a draft of about 40 pages -- a "first version" to be extended later -- which unfortunately never happened.

After his death in January, we decided (with his family's blessing) that it would be a good idea to publish this work so people could read his ideas about programming the Web, his ambivalence about different aspects of Semantic Web technology, his thoughts on Openness, and more.

(Thanks, Michael!)

# Big Data: A Revolution That Will Transform How We Live, Work, and Think

Big Data is a new book from Viktor Mayer-Schonberger, a respected Internet governance theorist; and Kenneth Cukier, a long-time technology journalist who's been on the Economist for many years. As the title and pedigree imply, this is a business-oriented book about "Big Data," a computational approach to business, regulation, science and entertainment that uses data-mining applied to massive, Internet-connected data-sets to learn things that previous generations weren't able to see because their data was too thin and diffuse.

Big Data is an eminently practical and sensible book, but it's also an exciting and excitable text, one that conveys enormous enthusiasm for the field and its fruits. The authors use well-chosen examples to show how everything from shipping logistics to video-game design to healthcare stand to benefit from studying the whole data-set, rather than random samples. They even pose this as a simple way of thinking of big data versus "small data." Small data relies on statistical sampling, and emphasises the reliability and accuracy of each measurement. With big data, you sample the entire pool of activities -- all the books sold, all the operations performed -- and worry less about inaccuracies and anomalies in individual measurements, because these are drowned out by the huge numbers of observations performed.

As you'd expect, Big Data is particularly fascinating when it explores the business implications of all this: the changing leverage between firms that own data versus the firms that know how to make sense of it, and why sometimes data is best processed by unaffiliated third parties who can examine data from rival firms and find out things from which all parties stand to benefit, but which none of them could have discovered on their own. They also cover some of the bigger Big Data business blunders through history -- companies whose culture blinkered them to the opportunities in their data, which were exploited by clever rivals.

The last fifth of the book is dedicated to issues of governance, regulation, and public policy. This is some of the most interesting material in the book and probably needs to be expanded into its own volume. As it is, there's a real sense that the authors are just scraping the surface. For example, many of the stories told in the book have deep privacy implications, and the authors make a point of touching on these, cabining them with phrases like "so long as the data is anonymized" or "adhering to privacy policy, of course." But in the final third, the authors examine the transcendental difficulty of real-world anonymization, and the titanic business blunders committed by firms that believed they'd stripped out the personal information from the data, only to have the data "de-anonymized" and their customers' privacy invaded in small and large ways. These two facts -- that many of the opportunities require effective anonymization and that no one knows how to do anonymization -- are a pretty big stumbling block in the world of Big Data, but the authors don't explicitly acknowledge the conundrum.

While Big Data is an excellent primer on the opportunities of the field, it's thin on the risks, overall. For example, Big Data is rightly fascinated with stories about how we can look at data sets and find predictors of consequential things: for example, when Google mined its query-history and compared it with CDC data on flu outbreaks, it found that it could predict flu outbreaks ahead of the CDC, which is amazingly useful. However, all those search-strings were entered by people who didn't expect to have them mined for subsequent action. If searching for "scratchy throat" and "runny nose" gets your neighborhood quarantined (or gets it extra healthcare dollars), you might get all your friends to search on those terms over and over -- or not at all. Google knows this -- or it should -- because when it started measuring the number of links between sites to define the latent authority of different parts of the Internet, it got great results, but immediately triggered a whole scummy ecosystem of linkfarms and other SEO tricks that create links whose purpose is to produce more of the indicators Google is searching for.

Another important subject is looking at algorithmic prediction in domains where the outcome is punishment, instead of reward. British Airways may get great results from using an algorithm to pick out passengers for upgrades, trying to find potential frequent fliers. But we should be very cautious about applying the same algorithm to building the TSA's No-Fly list. If BA's algorithm fails 20% of the time, it just means that a few lucky people get to ride up front of the plane. If the TSA has a 20% failure rate, it means that one in five "potential terrorists" is an innocent whose fundamental travel rights have been compromised by a secretive and unaccountable algorithm.

Secrecy and accountability are the third important area for examination in a Big Data world. Cukier and Mayer-Schonberger propose a kind of inspector-general for algorithms who'll make sure they're not corrupted to punish the undeserving or line someone's pockets unjustly. But they also talk about the fact that these algorithms are likely to be illegible -- the product of a continuously evolving machine-learning system -- and that no one will be able to tell you why a certain person was denied credit, refused insurance, kept out of a university, or blackballed for a choice job. And when you get into a world where you can't distinguish between an algorithm that gets it wrong because the math is unreliable (a "fair" wrong outcome) from an algorithm that gets it wrong because its creators set out to punish the innocent or enrich the undeserving, then we can't and won't have justice. We know that computers make mistakes, but when we combine the understandable enthusiasm for Big Data's remarkable, counterintuitive recommendations with the mysterious and oracular nature of the algorithms that produce those conclusions, then we're taking on a huge risk when we put these algorithms in charge of anything that matters.

# How an algorithm came up with Amazon's KEEP CALM AND RAPE A LOT t-shirt

You may have heard that Amazon is selling a "KEEP CALM AND RAPE A LOT" t-shirt. How did such a thing come to pass? Well, as Pete Ashton explains, this is a weird outcome of an automated algorithm that just tries random variations on "KEEP CALM AND," offering them for sale in Amazon's third-party marketplace and printing them on demand if any of them manage to find a buyer.

The t-shirts are created by an algorithm. The word “algorithm” is a little scary to some people because they don’t know what it means. It’s basically a process automated by a computer programme, sometimes simple, sometimes complex as hell. Amazon’s recommendations are powered by an algorithm. They look at what you’ve been browsing and buying, find patterns in that behaviour and show you things the algorithm things you might like to buy. Amazons algorithms are very complex and powerful, which is why they work. The algorithm that creates these t-shirts is not complex or powerful. This is how I expect it works.

1) Start a sentence with the words KEEP CALM AND.

2) Pick a word from this long list of verbs. Any word will do. Don’t worry, I’m sure they’re all fine.

3) Finish the sentence with one of the following: OFF, THEM, IF, THEM or US.

4) Lay these words out in the classic Keep Calm style.

5) Create a mockup jpeg of a t-shirt.

6) Submit the design to Amazon using our boilerplate t-shirt description.

7) Go back to 1 and start again.

There are currently 529,493 Solid Gold Bomb clothing items on Amazon. Assuming they survive this and don’t get shitcanned by Amazon I wouldn’t be at all surprised if they top a million in a few months.

It costs nothing to create the design, nothing to submit it to Amazon and nothing for Amazon to host the product. If no-one buys it then the total cost of the experiment is effectively zero. But if the algorithm stumbles upon something special, something that is both unique and funny and actually sells, then everyone makes money.

# Profane commit-messages from GitHub

Commit Logs From Last Night: highlights funny, profane source-code commit-messages from GitHub, as bedraggled hackers find themselves leaving notes documenting their desperate situations. Some recent ones:

WHY THE GODDAMMIT WHY WHY WHY HAROGIHAROGIAHRGOIA FUCK ME

render testing I DREW SOME LINES! reverted render panel to grew (white looks shit)

Merge pull request #15 from ruvetia/font_awesome_is_fucking_awesome include font-awesome into the projcet

# Students get class-wide As by boycotting test, solving Prisoner's Dilemma

Johns Hopkins computer science prof Peter Fröhlich grades his students' tests on a curve -- the top-scoring student gets an A, and the rest of the students are graded relative to that brainiac. But last term, his students came up with an ingenious, cooperative solution to this system: they all boycotted the test, meaning that they all scored zero, and that zero was the top score, and so they all got As. The prof was surprisingly cool about it:

Fröhlich took a surprisingly philosophical view of his students' machinations, crediting their collaborative spirit. "The students learned that by coming together, they can achieve something that individually they could never have done," he said via e-mail. “At a school that is known (perhaps unjustly) for competitiveness I didn't expect that reaching such an agreement was possible.”

The story of the boycott is a sterling example of how computer networks solve collective action problems -- the students solved a prisoner's dilemma in a mutually optimal way without having to iterate, which is impressive:

“The students refused to come into the room and take the exam, so we sat there for a while: me on the inside, they on the outside,” Fröhlich said. “After about 20-30 minutes I would give up.... Then we all left.” The students waited outside the rooms to make sure that others honored the boycott, and were poised to go in if someone had. No one did, though.

Andrew Kelly, a student in Fröhlich’s Introduction to Programming class who was one of the boycott’s key organizers, explained the logic of the students' decision via e-mail: "Handing out 0's to your classmates will not improve your performance in this course," Kelly said.

"So if you can walk in with 100 percent confidence of answering every question correctly, then your payoff would be the same for either decision. Just consider the impact on your other exam performances if you studied for [the final] at the level required to guarantee yourself 100. Otherwise, it's best to work with your colleagues to ensure a 100 for all and a very pleasant start to the holidays."

Fröhlich's changed the grading system -- but he's also now offering the students a final project instead of a final exam, should they choose.

Dangerous Curves [Zack Budryk/Inside Higher Ed]

# Malware-Industrial Complex: how the trade in software bugs is weaponizing insecurity

Here's a must-read story from Tech Review about the thriving trade in "zero-day exploits" -- critical software bugs that are sold off to military contractors to be integrated into offensive malware, rather than reported to the manufacturer for repair. The stuff built with zero-days -- network appliances that can snoop on a whole country, even supposedly secure conversations; viruses that can hijack the camera and microphone on your phone or laptop; and more -- are the modern equivalent of landmines and cluster bombs: antipersonnel weapons that end up in the hands of criminals, thugs and dictators who use them to figure out whom to arrest, torture, and murder. The US government is encouraging this market by participating actively in it, even as it makes a lot of noise about "cyber-defense."

Exploits for mobile operating systems are particularly valued, says Soghoian, because unlike desktop computers, mobile systems are rarely updated. Apple sends updates to iPhone software a few times a year, meaning that a given flaw could be exploited for a long time. Sometimes the discoverer of a zero-day vulnerability receives a monthly payment as long as a flaw remains undiscovered. “As long as Apple or Microsoft has not fixed it you get paid,” says Soghioan.

No law directly regulates the sale of zero-days in the United States or elsewhere, so some traders pursue it quite openly. A Bangkok, Thailand-based security researcher who goes by the name “the Grugq” has spoken to the press about negotiating deals worth hundreds of thousands of dollars with government buyers from the United States and western Europe. In a discussion on Twitter last month, in which he was called an “arms dealer,” he tweeted that “exploits are not weapons,” and said that “an exploit is a component of a toolchain … the team that produces & maintains the toolchain is the weapon.”

The Grugq contacted MIT Technology Review to state that he has made no “public statement about exploit sales since the Forbes article.”

Some small companies are similarly up-front about their involvement in the trade. The French security company VUPEN states on its website that it “provides government-grade exploits specifically designed for the Intelligence community and national security agencies to help them achieve their offensive cyber security and lawful intercept missions.” Last year, employees of the company publicly demonstrated a zero-day flaw that compromised Google’s Chrome browser, but they turned down Google’s offer of a \$60,000 reward if they would share how it worked. What happened to the exploit is unknown.

Welcome to the Malware-Industrial Complex [Tom Simonite/MIT Technology Review]

## Robots say the craziest things —

This morning, while hurrying down the concourse at La Guardia Airport, I tried to dictate a text message to my Nexus 4 while wheeling my suitcase behind me. It got the dictation fine, but appended "kdkdkdkdkdkdkdkd" to the message -- this being its interpretation of the sound of my suitcase wheels on the tiles. Cory

# Regular expressions crossword

On Coinheist.com, a crossword puzzle you solve by interpreting regular expressions.

# Casino panopticon: a look at the CCTV room in the Vegas Aria

A fascinating article in The Verge looks at the history of casino cheating and talks to Ted Whiting, director of surveillance at the Aria casino in Vegas, who specced out a huge, showy CCTV room with feeds from more than 1,100 cameras. They use a lot of machine intelligence to raise potential cheating to the attention of the operators.

Despite that, Whiting says facial recognition software hasn’t been of much use to him. It’s simply too unreliable when it comes to spotting people on the move, in crowds, and under variable lighting. Instead, he and his team rely on pictures shared from other casinos, as well as through the Biometrica and Griffin databases. (The Griffin database, which contains pictures and descriptions of various undesirables, used to go to subscribers as massive paper volumes.) But quite often, they’re not looking for specific people, but rather patterns of behavior. "Believe it or not, when you've done this long enough," he says, "you can tell when somebody's up to no good. It just doesn't feel right."

They keep a close eye on the tables, since that’s where cheating’s most likely to occur. With 1080p high-definition cameras, surveillance operators can read cards and count chips — a significant improvement over earlier cameras. And though facial recognition doesn’t yet work reliably enough to replace human operators, Whiting’s excited at the prospects of OCR. It’s already proven useful for identifying license plates. The next step, he says, is reading cards and automatically assessing a player’s strategy and skill level. In the future, maybe, the cameras will spot card counters and other advantage players without any operator intervention. (Whiting, a former advantage player himself, can often spot such players. Rather than kick them out, as some casinos did in the past, Aria simply limits their bets, making it economically disadvantageous to keep playing.)

With over a thousand cameras operating 24/7, the monitoring room creates tremendous amounts of data every day, most of which goes unseen. Six technicians watch about 40 monitors, but all the feeds are saved for later analysis. One day, as with OCR scanning, it might be possible to search all that data for suspicious activity. Say, a baccarat player who leaves his seat, disappears for a few minutes, and is replaced with another player who hits an impressive winning streak. An alert human might spot the collusion, but even better, video analytics might flag the scene for further review. The valuable trend in surveillance, Whiting says, is toward this data-driven analysis (even when much of the job still involves old-fashioned gumshoe work). "It's the data," he says, "And cameras now are data. So it's all data. It's just learning to understand that data is important."

One thing I wanted to see in this piece was some reflection on how casino level of surveillance, and the casino theory of justice (we spy on everyone to catch the guilty people) has become the new normal across the world.

Not in my house: how Vegas casinos wage a war on cheating [Jesse Hicks/The Verge]

(via Kottke)

# Montreal comp sci student reports massive bug, is expelled and threatened with arrest for checking to see if it had been fixed

Ahmed Al-Khabaz was a 20-year-old computer science student at Dawson College in Montreal, until he discovered a big, glaring bug in Omnivox, software widely used by Quebec's junior college system. The bug exposed the personal information (social insurance number, home address, class schedule) of its users. When Al-Khabaz reported the bug to François Paradis, his college's Director of Information Services and Technology, he was congratulated. But when he checked a few days later to see if the bug had been fixed, he was threatened with arrest and made to sign a secret gag-order whose existence he wasn't allowed to disclose. Then, he was expelled:

“I was called into a meeting with the co–ordinator of my program, Ken Fogel, and the dean, Dianne Gauvin,” says Mr. Al-Khabaz. “They asked a lot of questions, mostly about who knew about the problems and who I had told. I got the sense that their primary concern was covering up the problem.”

Following this meeting, the fifteen professors in the computer science department were asked to vote on whether to expel Mr. Al-Khabaz, and fourteen voted in favour. Mr. Al-Khabaz argues that the process was flawed because he was never given a chance to explain his side of the story to the faculty. He appealed his expulsion to the academic dean and even director-general Richard Filion. Both denied the appeal, leaving him in academic limbo.

“I was acing all of my classes, but now I have zeros across the board. I can’t get into any other college because of these grades, and my permanent record shows that I was expelled for unprofessional conduct. I really want this degree, and now I won’t be able to get it. My academic career is completely ruined. In the wrong hands, this breach could have caused a disaster. Students could have been stalked, had their identities stolen, their lockers opened and who knows what else. I found a serious problem, and tried to help fix it. For that I was expelled.”

The thing that gets me, as a member of a computer science faculty, is how gutless his instructors were in their treatment of this promising student. They're sending a clear signal that you're better off publicly disclosing bugs without talking to faculty or IT than going through channels, because "responsible disclosure" means that bugs go unpatched, students go unprotected, and your own teachers will never, ever have your back.

Shame on them.

# How Twitter figures out the world with machine intelligence and Mechanical Turks

On Twitter's engineering blog, a fascinating description of how Twitter uses a blend of machine intelligence and Mechanical Turk tasks to figure out, in real time, what is going on in the world:

Before we delve into the details, here's an overview of how the system works.

1. First, we monitor for which search queries are currently popular.
Behind the scenes: we run a Storm topology that tracks statistics on search queries.
For example, the query [Big Bird] may suddenly see a spike in searches from the US.

2. As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon's Mechanical Turk service, and then polls Mechanical Turk for a response.
For example: as soon as we notice "Big Bird" spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

3. Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

# Probability theory for programmers

Jeremy Kun, a mathematics PhD student at the University of Illinois in Chicago, has posted a wonderful primer on probability theory for programmers on his blog. It's a subject vital to machine learning and data-mining, and it's at the heart of much of the stuff going on with Big Data. His primer is lucid and easy to follow, even for math ignoramuses like me.

For instance, suppose our probability space is $\Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \}$ and $f$ is defined by setting $f(x) = 1/6$ for all $x \in \Omega$ (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset $\left \{ 2, 4, 6 \right \}$, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

(Image: Dice, a Creative Commons Attribution (2.0) image from artbystevejohnson's photostream)

# Phrases used by corporate fraudsters

The FBI and Ernst and Young have released a list of top-ten phrases that indicate corporate fraud, based on data-mining evidence from real corporate fraud investigations.

In total more than 3,000 terms are logged by the technology, which monitors for conversations within the "fraud triangle", where pressure, rationalisation, and opportunity meet, said the FBI and Ernst & Young...

1. Cover up
2. Write off
3. Illegal
4. Failed investment
5. Nobody will find out
6. Grey area
7. They owe it to me
8. Do not volunteer information
9. Not ethical
10. Off the books

# Inception: a tool for compromising the slumber of computers with full-disk encryption

Inception is a tool for breaking into computers with full-disk encryption. It assumes that you have access to a suspended/screen-locked computer whose disk is encrypted. You access the machine over its FireWire interface (or, if it doesn't have FireWire, you plug a FireWire card into one of its slots, and the machine will automatically fetch, install and configure the drivers, even if it's asleep), and then use the FireWire drivers to directly access system memory, and from there, patch the password-checking routine and walk straight into the computer.

This (and its predecessors, like winlockpwn) is a substantial advance on previous attacks against sleeping full-disk encrypted systems, which involved things like plunging the RAM into a bath of liquid nitrogen. As the author, Carsten Maartmann-Moe, points out, this can't be easily remedied with a FireWire driver update, since FireWire requires direct memory access to effect high-speed transfers.

So, two things: First, shut down your computer when it's not in your possession; second, "Inception" is an inspired name for an attack that breaks into the dreams of a sleeping computer, directly accesses its memory, and causes it to spill its secrets.

Inception’s main mode works as follows: By presenting a Serial Bus Protocol 2 (SBP-2) unit directory to the victim machine over the IEEE1394 FireWire interface, the victim operating system thinks that a SBP-2 device has connected to the FireWire port. Since SBP-2 devices utilize Direct Memory Access (DMA) for fast, large bulk data transfers (e.g., FireWire hard drives and digital camcorders), the victim lowers its shields and enables DMA for the device. The tool now has full read/write access to the lower 4GB of RAM on the victim. Once DMA is granted, the tool proceeds to search through available memory pages for signatures at certain offsets in the operating system’s password authentication modules. Once found, the tool short circuits the code that is triggered if an incorrect password is entered.

An analogy for this operation is planting an idea into the memory of the machine; the idea that every password is correct. In other words, the nerdy equivalent of a memory inception.

After running the tool you should be able to log into the victim machine using any password.

Inception (via JWZ)

# Community Memory: a social media terminal from 1973

Wired's gallery of the paleolithic antecedents of today's social media technologies is a bit mismatched (some really interesting insights into today's media lineage, but mixed with some silliness), but the lead item, the Community Memory terminal from 1973, is pure gold. I wrote half an unsuccessful novel about this thing when I was about 25, and it's never stopped haunting me.

Three decades before Yelp and Craigslist, there was the Community Memory Terminal.

In the early 1970s, Efrem Lipkin, Mark Szpakowski and Lee Felsenstein set up a series of these terminals around San Francisco and Berkeley, providing access to an electronic bulletin board housed by a XDS-940 mainframe computer.

This started out as a social experiment to see if people would be willing to share via computer -- a kind of "information flea market," a "communication system which allows people to make contact with each other on the basis of mutually expressed interest," according to a brochure from the time.

What evolved was a proto-Facebook-Twitter-Yelp-Craigslist-esque database filled with searchable roommate-wanted and for-sale items ads, restaurant recommendations, and, well, status updates, complete with graphics and social commentary.

"This was really one of the very first attempts to give access to computers to ordinary people," says Marc Weber, the founding curator of the Internet History Program at the Computer History Museum in Mountain View, California.

Holy shit, that is a thing of beauty.