# How Twitter figures out the world with machine intelligence and Mechanical Turks

On Twitter's engineering blog, a fascinating description of how Twitter uses a blend of machine intelligence and Mechanical Turk tasks to figure out, in real time, what is going on in the world:

Before we delve into the details, here's an overview of how the system works.

1. First, we monitor for which search queries are currently popular.
Behind the scenes: we run a Storm topology that tracks statistics on search queries.
For example, the query [Big Bird] may suddenly see a spike in searches from the US.

2. As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon's Mechanical Turk service, and then polls Mechanical Turk for a response.
For example: as soon as we notice "Big Bird" spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

3. Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

# Probability theory for programmers

Jeremy Kun, a mathematics PhD student at the University of Illinois in Chicago, has posted a wonderful primer on probability theory for programmers on his blog. It's a subject vital to machine learning and data-mining, and it's at the heart of much of the stuff going on with Big Data. His primer is lucid and easy to follow, even for math ignoramuses like me.

For instance, suppose our probability space is $\Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \}$ and $f$ is defined by setting $f(x) = 1/6$ for all $x \in \Omega$ (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset $\left \{ 2, 4, 6 \right \}$, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

(Image: Dice, a Creative Commons Attribution (2.0) image from artbystevejohnson's photostream)

# Phrases used by corporate fraudsters

The FBI and Ernst and Young have released a list of top-ten phrases that indicate corporate fraud, based on data-mining evidence from real corporate fraud investigations.

In total more than 3,000 terms are logged by the technology, which monitors for conversations within the "fraud triangle", where pressure, rationalisation, and opportunity meet, said the FBI and Ernst & Young...

1. Cover up
2. Write off
3. Illegal
4. Failed investment
5. Nobody will find out
6. Grey area
7. They owe it to me
8. Do not volunteer information
9. Not ethical
10. Off the books

# Inception: a tool for compromising the slumber of computers with full-disk encryption

Inception is a tool for breaking into computers with full-disk encryption. It assumes that you have access to a suspended/screen-locked computer whose disk is encrypted. You access the machine over its FireWire interface (or, if it doesn't have FireWire, you plug a FireWire card into one of its slots, and the machine will automatically fetch, install and configure the drivers, even if it's asleep), and then use the FireWire drivers to directly access system memory, and from there, patch the password-checking routine and walk straight into the computer.

This (and its predecessors, like winlockpwn) is a substantial advance on previous attacks against sleeping full-disk encrypted systems, which involved things like plunging the RAM into a bath of liquid nitrogen. As the author, Carsten Maartmann-Moe, points out, this can't be easily remedied with a FireWire driver update, since FireWire requires direct memory access to effect high-speed transfers.

So, two things: First, shut down your computer when it's not in your possession; second, "Inception" is an inspired name for an attack that breaks into the dreams of a sleeping computer, directly accesses its memory, and causes it to spill its secrets.

Inception’s main mode works as follows: By presenting a Serial Bus Protocol 2 (SBP-2) unit directory to the victim machine over the IEEE1394 FireWire interface, the victim operating system thinks that a SBP-2 device has connected to the FireWire port. Since SBP-2 devices utilize Direct Memory Access (DMA) for fast, large bulk data transfers (e.g., FireWire hard drives and digital camcorders), the victim lowers its shields and enables DMA for the device. The tool now has full read/write access to the lower 4GB of RAM on the victim. Once DMA is granted, the tool proceeds to search through available memory pages for signatures at certain offsets in the operating system’s password authentication modules. Once found, the tool short circuits the code that is triggered if an incorrect password is entered.

An analogy for this operation is planting an idea into the memory of the machine; the idea that every password is correct. In other words, the nerdy equivalent of a memory inception.

After running the tool you should be able to log into the victim machine using any password.

Inception (via JWZ)

# Community Memory: a social media terminal from 1973

Wired's gallery of the paleolithic antecedents of today's social media technologies is a bit mismatched (some really interesting insights into today's media lineage, but mixed with some silliness), but the lead item, the Community Memory terminal from 1973, is pure gold. I wrote half an unsuccessful novel about this thing when I was about 25, and it's never stopped haunting me.

Three decades before Yelp and Craigslist, there was the Community Memory Terminal.

In the early 1970s, Efrem Lipkin, Mark Szpakowski and Lee Felsenstein set up a series of these terminals around San Francisco and Berkeley, providing access to an electronic bulletin board housed by a XDS-940 mainframe computer.

This started out as a social experiment to see if people would be willing to share via computer -- a kind of "information flea market," a "communication system which allows people to make contact with each other on the basis of mutually expressed interest," according to a brochure from the time.

What evolved was a proto-Facebook-Twitter-Yelp-Craigslist-esque database filled with searchable roommate-wanted and for-sale items ads, restaurant recommendations, and, well, status updates, complete with graphics and social commentary.

"This was really one of the very first attempts to give access to computers to ordinary people," says Marc Weber, the founding curator of the Internet History Program at the Computer History Museum in Mountain View, California.

Holy shit, that is a thing of beauty.

# Machine-learning algorithm develops heuristics for trustworthy tweets in time of emergency

In "Credibility ranking of tweets during high impact events," a paper published in the ACM's Proceedings of the 1st Workshop on Privacy and Security in Online Social Media , two Indraprastha Institute of Information Technology researchers describe the outcome of a machine-learning experiment that was asked to discover factors correlated with reliability in tweets during disasters and emergencies:

The number of unique characters present in tweet was positively correlated to credibility, this may be due to the fact that tweets with hashtags, @mentions and URLs contain more unique characters. Such tweets are also more informative and linked, and hence credible. Presence of swear words in tweets indicates that it contains the opinion / reaction of the user and would have less chances of providing informa- tion about the event. Tweets that contain information or are reporting facts about the event, are impersonal in nature, as a result we get a negative correlation of presence of pronouns in credible tweets. Low number of happy emoticons [:-), :)] and high number of sad emoticons [:-(, :(] act as strong predictors of credibility. Some of the other important features (p-value < 0.01) were inclusion of a URL in the tweet, number of followers of the user who tweeted and presence of negative emotion words. Inclusion of URL in a tweet showed a strong positive correlation with credibility, as most URLs refer to pictures, videos, resources related to the event or news articles about the event.

Of course, this is all non-adversarial: no one is trying to trick a filter into mis-assessing a false account as a true one. It's easy to imagine an adversarial tweet-generator that suggests rewrites to deliberately misleading tweets to make them more credible to a filter designed on these lines. This is actually the substance of one of the cleverest science fiction subplots I've read: in Peter Watt's Behemoth, in which a self-modifying computer virus randomly hits on the strategy of impersonating communications from patient zero in a world-killing pandemic, because all the filters allow these through. It's a premise that's never stopped haunting me: the co-evolution of a human virus and a computer virus.

# Radio Shack computer catalog from 1983

On the Internet Archive, a hi-rez scan of the 1983 Radio Shack computer catalog, which is a wonderland of jaw-dropping prices for prosumer equipment from my boyhood that doesn't even qualify as a toddler's toy today. I will always retain a fondness for acoustic couplers, though, as they were the way I first connected to a computer, running a screenless teletype terminal connected to my Dad's university PDP by means of one of these suction-cup wonders. There was something, I dunno, legible about being able to see how the Bell handset fit into that cradle, to hear the barely audible tinny whine of the characters crawling over the wire. It was like being able to watch nerve impulses travel from a brain to a distant limb.

But $995? Please. # Flying malware: the Virus Copter At the latest San Francisco Drone Olympics (now called DroneGames, thanks, no doubt, to awful bullying from the organized crime syndicate known as the International Olympic Committee), there were many fascinating entries, but the champion was James "substack" Halliday's Virus-Copter (github), which made wireless contact with its competitors, infected them with viruses that put them under its control, sent them off to infect the rest of the cohort, and then caused them to "run amok." Many people have written to point out that Virus-Copter shares some DNA with one of the plot elements in my novel Pirate Cinema, but I assure you the resemblance is entirely coincidental. Drones, after all, are stranger than technothrillers. Here's the$300 drone the competitors were flying.

node cross-compiled for the ARM chips running on the drones
* felixge's ar-drone module
* some iwconfig/iwlist wrappers in lib/iw.js
* open wireless networks in nodes.json (gathered by the deployment computer)

# Crypto and Bletchley Park podcast from BBC's Infinite Monkey Cage

BBC Radio 4's great math and science show "The Infinite Monkey Cage" did a great (and very funny) episode on crypto and Bletchley Park, with Robin Ince, Brian Cox, Dave Gorman, Simon Singh and Dr Sue Black.

(via Schneier)

# Amazing, invisible work that goes on when you click an HTTPS link

Jeff Moser has a clear, fascinating enumeration of all the incredible math stuff that happens between a server and your browser when you click on an HTTPS link and open a secure connection to a remote end. It's one of the most important (and least understood) parts of the technical functioning of the Internet.

People sometimes wonder if math has any relevance to programming. Certificates give a very practical example of applied math. Amazon's certificate tells us that we should use the RSA algorithm to check the signature. RSA was created in the 1970's by MIT professors Ron *R*ivest, Adi *S*hamir, and Len *A*dleman who found a clever way to combine ideas spanning 2000 years of math development to come up with a beautifully simple algorithm:

You pick two huge prime numbers "p" and "q." Multiply them to get "n = p*q." Next, you pick a small public exponent "e" which is the "encryption exponent" and a specially crafted inverse of "e" called "d" as the "decryption exponent." You then make "n" and "e" public and keep "d" as secret as you possibly can and then throw away "p" and "q" (or keep them as secret as "d"). It's really important to remember that "e" and "d" are inverses of each other.

Now, if you have some message, you just need to interpret its bytes as a number "M." If you want to "encrypt" a message to create a "ciphertext", you'd calculate:

C ≡ Me (mod n)

This means that you multiply "M" by itself "e" times. The "mod n" means that we only take the remainder (e.g. "modulus") when dividing by "n." For example, 11 AM + 3 hours ≡ 2 (PM) (mod 12 hours). The recipient knows "d" which allows them to invert the message to recover the original message:

Cd ≡ (Me)d ≡ Me*d ≡ M1 ≡ M (mod n)

# Cracking passwords with 25 GPUs

Security Ledger reports on a breakthrough in password-cracking, using 25 graphics cards in parallel to churn through astounding quantities of password possibilities in unheard-of timescales. It's the truly the end of the line for passwords protected by older hashing algorithms and illustrates neatly how yesterday's "password that would take millions of years to break" is this year's "password broken in an afternoon," and has profound implications for the sort of password hash-dumps we've seen in the past two years.

A presentation at the Passwords^12 Conference in Oslo, Norway (slides available here), has moved the goalposts, again. Speaking on Monday, researcher Jeremi Gosney (a.k.a epixoip) demonstrated a rig that leveraged the Open Computing Language (OpenCL) framework and a technology known as Virtual Open Cluster (VCL) to run the HashCat password cracking  program across a cluster of five, 4U servers equipped with 25 AMD Radeon GPUs and communicating at  10 Gbps and 20 Gbps over  Infiniband switched fabric.

Gosney’s system elevates password cracking to the next level, and effectively renders even the strongest passwords protected with weaker encryption algorithms, like Microsoft’s LM and NTLM, obsolete.

In a test, the researcher’s system was able to churn through 348 billion NTLM password hashes per second. That renders even the most secure password vulnerable to compute-intensive brute force and wordlist (or dictionary) attacks. A 14 character Windows XP password hashed using NTLM (NT Lan Manager), for example, would fall in just six minutes, said Per Thorsheim, organizer of the Passwords^12 Conference.

# Computer classes should teach regular expressions to kids

My latest Guardian column is "Here's what ICT should really teach kids: how to do regular expressions," and it makes the case for including regular expressions in foundational IT and computer science courses. Regexp offer incredible power to normal people in their normal computing tasks, and we treat them as deep comp-sci, instead of something everyone should learn alongside typing.

I think that technical people underestimate how useful regexps are for "normal" people, whether a receptionist labouriously copy-pasting all the surnames from a word-processor document into a spreadsheet, a school administrator trying to import an old set of school records into a new system, or a mechanic hunting through a parts list for specific numbers.

The reason technical people forget this is that once you know regexps, they become second nature. Any search that involves more than a few criteria is almost certainly easier to put into a regexp, even if your recollection of the specifics is fuzzy enough that you need to quickly look up some syntax online.

# Gigapixel images of Charles Babbage's Difference Engine #2

Greg sez, "This project is using a number of computational photography techniques to document Charles Babbage's 'Difference Engine No 2' for the Computer History Museum in Mountain View. There are interactive gigapixel images for the four cardinal views of the device available to view."

Babbage Difference Engine in Gigapixel (Thanks, Greg!)

# Orientation video for Bell Labs programmers, 1973

Here's a 1973 orientation video from Bell Labs' Holmdel Computer Center, to get new, budding Unix hackers acquainted with all the different apparatus available to them, and also to let them know which counter to visit to get a different tape loaded onto one of the IBM mainframes.

The Holmdel Computer Center, Part 1 (Thanks, Dan!)

# The many stages of writing a research paper

Timothy Weninger recently submitted a research paper to a computer science conference called World Wide Web. On his way to that, he went through 463 drafts. Bear in mind, this paper has only been submitted, not yet accepted, so there's probably even more edits that are still yet to happen. Welcome to the life of a scientist.

In this video, Weninger created a timelapse showing all the different stages of his writing process, as he added graphs and went through cycles of expanding, contracting, and expanding the text. But mostly expanding. The paper grows from two pages to 10 by the end of the video.