Alex sez, "Algoraves are parties where people come together to dance to algorithms. It generally involves some live coding but any producers making music "wholly or predominantly characterised by the emission of a succession of repetitive conditionals' are welcome. Generally some aspect of the algorithmic processes are visible, but the focus is actually on the audience, and having serious fun. We've had a few parties across the UK and Germany, and are spreading further afield in Mexico and Australia. The concept is still developing though, and is being defined by whoever turns up."
Here's the video of "It's not a fax machine connected to a waffle iron," the talk I gave at the Re:publica conference in Berlin this week: "Lawmakers treat the Internet like it's Telephone 2.0, the Second Coming of Video on Demand, or the World's Number One Porn Distribution Service, but it's really the nervous system of the 21st Century. Unless we stop the trend toward depraved indifference in Internet law, making – and freedom – will die."
With Scratch, you can program your own interactive stories, games, and animations — and share your creations with others in the online community.
Scratch helps young people learn to think creatively, reason systematically, and work collaboratively — essential skills for life in the 21st century.
Wagner James Au sez, "OpenWorm, as the name suggests, is a collaborative open source project to computationally create a simple artificial life form -- an earth worm -- from the cellular level to a point where it's sophisticated enough to solve basic problems. They're still in early stages, with the latest demo, a developer on the project tells me, being 'a particle simulation of five connected muscle segments moving together through a body of water.'"
Given a standard Tetris engine (which drops pieces in a pseudorandom order, has previews, and allows holding), this method will allow you to play Tetris forever. As always, the most fascinating thing about this is the specialized vocabulary used to describe the method:
Worst case bag distributions such as H?XX?X? and H?XXX?? deserve a special mention. The first piece 'H' denotes a piece which must be placed in Hold in order to follow the STZ loop procedure. Pieces from the LJO loop are denoted by '?', and the remaining pieces are denoted by 'X'. Using 3 previews and Hold, it is only possible to see the first 4 pieces of the bag before the second piece enters the screen. This means you only see H?XX, and only know the first piece of the LJO loop. Because H must be put in Hold, you are forced to make a decision without knowing the order of the rest of the LJO loop. If the O comes first, you can follow the procedure above without problems. The rest of the time you will run into complications like this:
A Hal Pomeranz from 2010 suggests a great way to teach TCP/IP header structure to students: he builds header diagrams out of legos, then mixes them up and has the students reconstruct them.
The use of color here really highlights certain portions of the packet header. For example, the source and destination addresses and ports really jump out. But there are some other, more subtle color patterns that I worked in here. For example, if you look closely you’ll see that I matched the color of the ACK bit with the blue in the ACK number field. Similarly the colors of the SYN bit and the sequence number match, as do the URG bit and urgent pointer field.
Actually I wish I had a couple of more colors available. Yes, Lego comes in dozens of colors these days, but they only make 2×8 blocks (aka one “Lego Byte”) in six colors: White, Black, Red, Yellow, Blue, and Beige.
So while I tried to use Beige exclusively for size fields, Red for reserved bits, Yellow for checksums, and so on, I ultimately ended up having to use these colors for other fields as well– for example, the yellow sequence number fields in the TCP header. Maybe I should have just bought a bunch of “nibbles” (2×4 blocks) in other colors and not been so choosy about using full “Lego Bytes”.
Since 2010, the lego patent has expired and cheapish wire-extrusion 3D printing has become a reality -- and there's cool procedural models for generating arbitrary-sized bricks and labelling them with arbitrary type. Someone needs to make a printable TCP diagramming set on Thingiverse!
There's precious little info available about Mizirk "Boob Tracker," a computer vision project (based on a Kinekt?) that automatically detects boob-like objects and masks them with user-selectable bitmaps, following them as they move around the field of view. Mizirk's total delight in the performance of this little confection is what makes it.
(Thanks, Fipi Lele!)
Paul sez, "This past semester, three engineering grad students at the University of Toronto (myself and two others) created an Android app for a course project that allows for wireless and intuitive control of a robotic arm from an Android-powered smartphone. We're pretty proud of the results (the link is to a demo we put together) and have released the code open source."
Android Robotic Manipulator Demo (Thanks, Paul!)
Thearn released a free/open program for detecting and monitoring your pulse using your webcam. The code is on github for you to download, play with and modify. If this stuff takes your fancy, be sure and read Eulerian Video Magnification for Revealing Subtle Changes in the World, an inspiring paper describing the techniques Thearn uses in his code:
This application uses openCV (http://opencv.org/) to find the location of the user's face, then isolate the forehead region. Data is collected from this location over time to estimate the user's heartbeat frequency. This is done by measuring average optical intensity in the forehead location, in the subimage's green channel alone. Physiological data can be estimated this way thanks to the optical absorbtion characteristics of oxygenated hemoglobin.
With good lighting and minimal noise due to motion, a stable heartbeat should be isolated in about 15 seconds. Other physiological waveforms, such as Mayer waves (http://en.wikipedia.org/wiki/Mayer_waves), should also be visible in the raw data stream.
Once the user's pulse signal has been isolated, temporal phase variation associated with the detected hearbeat frequency is also computed. This allows for the heartbeat frequency to be exaggerated in the post-process frame rendering; causing the highlighted forhead location to pulse in sync with the user's own heartbeat (in real time).
Support for pulse-detection on multiple simultaneous people in an camera's image stream is definitely possible, but at the moment only the information from one face is extracted for cardiac analysis
Dr. Tom Murphy VII gave a research paper called "The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel . . . after that it gets a little tricky," (PDF) (source code) at SIGBOVIK 2013, in which he sets out a computational method for solving classic NES games. He devised two libraries for this: learnfun (learning fuction) and playfun (playing function). In this accompanying video, he chronicles the steps and missteps he took getting to a pretty clever destination.
In Wired, Steven Levy has a long profile of the fascinating field of algorithmic news-story generation. Levy focuses on Narrative Science, and its competitor Automated Insights, and discusses how the companies can turn "data rich" streams into credible news-stories whose style can be presented as anything from sarcastic blogger to dry market analyst. Narrative Science's cofounder, Kristian Hammond, claims that 90 percent of all news will soon be algorithmically generated, but that this won't be due to computers stealing journalists' jobs -- rather, it will be because automation will enable the creation of whole classes of news stories that don't exist today, such as detailed, breezy accounts of every little league game in the country.
Narrative Science’s writing engine requires several steps. First, it must amass high-quality data. That’s why finance and sports are such natural subjects: Both involve the fluctuations of numbers—earnings per share, stock swings, ERAs, RBI. And stats geeks are always creating new data that can enrich a story. Baseball fans, for instance, have created models that calculate the odds of a team’s victory in every situation as the game progresses. So if something happens during one at-bat that suddenly changes the odds of victory from say, 40 percent to 60 percent, the algorithm can be programmed to highlight that pivotal play as the most dramatic moment of the game thus far. Then the algorithms must fit that data into some broader understanding of the subject matter. (For instance, they must know that the team with the highest number of “runs” is declared the winner of a baseball game.) So Narrative Science’s engineers program a set of rules that govern each subject, be it corporate earnings or a sporting event. But how to turn that analysis into prose? The company has hired a team of “meta-writers,” trained journalists who have built a set of templates. They work with the engineers to coach the computers to identify various “angles” from the data. Who won the game? Was it a come-from-behind victory or a blowout? Did one player have a fantastic day at the plate? The algorithm considers context and information from other databases as well: Did a losing streak end?
Then comes the structure. Most news stories, particularly about subjects like sports or finance, hew to a pretty predictable formula, and so it’s a relatively simple matter for the meta-writers to create a framework for the articles. To construct sentences, the algorithms use vocabulary compiled by the meta-writers. (For baseball, the meta-writers seem to have relied heavily on famed early-20th-century sports columnist Ring Lardner. People are always whacking home runs, swiping bags, tallying runs, and stepping up to the dish.) The company calls its finished product “the narrative.”
Both companies claim that they'll be able to make sense of less-quantifiable subjects in the future, and will be able to generate stories about them, too.
This morning, I posted M Tang`s funny experiment in feeding the Unix "yes" command to itself. Now, Seth David Schoen writes in to correct and expand upon the principles therein:
M. Tang`s business about the Unix command
yes `yes no`
is based on a bit of a misconception. The problem is _not_ about combining one yes command with another yes command. Whenever you use the backtick syntax `, like in a hypothetical command
the shell will first run the command bar (to completion) before it even tries to start foo. The shell will also save the complete output of bar in memory, and then present it as a set of command-line arguments to foo.
In this case, the shell is trying to run the command "yes no" to completion, saving its output in memory, before even starting the other yes command. Of course, "yes no" never finishes, but it does use up an arbitrarily large amount of memory.
To see that the problem is with the use of `yes` rather than with the combination of two yes commands, just try
echo `yes no`
true `yes no`
Both of these forms have exactly the same memory-consumption problem as the original command, and for exactly the same reason! So, Tang is wrong to think that he is somehow creating a problem by combining multiple yesses. The problem is in asking the shell to remember an infinite amount of output.
As other people have mentioned in comments, the ` syntax is also not piping. Piping is done with |, while ` refers to substitution. The distinction is whether the output of program A appears as input to program B (piping) or as command-line arguments to program B (substitution). For example,
echo foo bar | wc -w
outputs the number 2 (that`s the total number of words in the text "foo bar"), while
wc -w `echo foo bar`
counts the number of words in the files foo and bar.
Update: M Tang's explanation for this is wrong, but Seth Schoen sent us a great correction
There's a GNU-coreutils program called yes whose function is to "output a string repeatedly until killed." M Tang tried piping the output of one yes command into another. It ended badly:
Taking a look at the source code for yes, it looks like the single argument is being stored in a char array, then, in a while(true) and for loop, each character is printed to the stdout, followed by a new line (\n) character.
So when we use the output of one yes command as the argument for another, the outer yes command fills up the computer’s memory with the output of the inner yes command. Then I have to restart my computer and feel stupid.
Michael B. Morgan, CEO of Morgan & Claypool Publishers, writes:
In 2009, we invited Aaron Swartz to contribute a short work to our series on Web Engineering (now The Semantic Web: Theory and Technology). He produced a draft of about 40 pages -- a "first version" to be extended later -- which unfortunately never happened.
After his death in January, we decided (with his family's blessing) that it would be a good idea to publish this work so people could read his ideas about programming the Web, his ambivalence about different aspects of Semantic Web technology, his thoughts on Openness, and more.
As a tribute to Aaron, we have posted his work on our site as a free PDF download. It is licensed under a Creative Commons (CC-BY-SA-NC) license. The work stands as originally written, with only a few typographical errors corrected to improve readability.
Big Data is a new book from Viktor Mayer-Schonberger, a respected Internet governance theorist; and Kenneth Cukier, a long-time technology journalist who's been on the Economist for many years. As the title and pedigree imply, this is a business-oriented book about "Big Data," a computational approach to business, regulation, science and entertainment that uses data-mining applied to massive, Internet-connected data-sets to learn things that previous generations weren't able to see because their data was too thin and diffuse.
Big Data is an eminently practical and sensible book, but it's also an exciting and excitable text, one that conveys enormous enthusiasm for the field and its fruits. The authors use well-chosen examples to show how everything from shipping logistics to video-game design to healthcare stand to benefit from studying the whole data-set, rather than random samples. They even pose this as a simple way of thinking of big data versus "small data." Small data relies on statistical sampling, and emphasises the reliability and accuracy of each measurement. With big data, you sample the entire pool of activities -- all the books sold, all the operations performed -- and worry less about inaccuracies and anomalies in individual measurements, because these are drowned out by the huge numbers of observations performed.
As you'd expect, Big Data is particularly fascinating when it explores the business implications of all this: the changing leverage between firms that own data versus the firms that know how to make sense of it, and why sometimes data is best processed by unaffiliated third parties who can examine data from rival firms and find out things from which all parties stand to benefit, but which none of them could have discovered on their own. They also cover some of the bigger Big Data business blunders through history -- companies whose culture blinkered them to the opportunities in their data, which were exploited by clever rivals.
While Big Data is an excellent primer on the opportunities of the field, it's thin on the risks, overall. For example, Big Data is rightly fascinated with stories about how we can look at data sets and find predictors of consequential things: for example, when Google mined its query-history and compared it with CDC data on flu outbreaks, it found that it could predict flu outbreaks ahead of the CDC, which is amazingly useful. However, all those search-strings were entered by people who didn't expect to have them mined for subsequent action. If searching for "scratchy throat" and "runny nose" gets your neighborhood quarantined (or gets it extra healthcare dollars), you might get all your friends to search on those terms over and over -- or not at all. Google knows this -- or it should -- because when it started measuring the number of links between sites to define the latent authority of different parts of the Internet, it got great results, but immediately triggered a whole scummy ecosystem of linkfarms and other SEO tricks that create links whose purpose is to produce more of the indicators Google is searching for.
Another important subject is looking at algorithmic prediction in domains where the outcome is punishment, instead of reward. British Airways may get great results from using an algorithm to pick out passengers for upgrades, trying to find potential frequent fliers. But we should be very cautious about applying the same algorithm to building the TSA's No-Fly list. If BA's algorithm fails 20% of the time, it just means that a few lucky people get to ride up front of the plane. If the TSA has a 20% failure rate, it means that one in five "potential terrorists" is an innocent whose fundamental travel rights have been compromised by a secretive and unaccountable algorithm.
Secrecy and accountability are the third important area for examination in a Big Data world. Cukier and Mayer-Schonberger propose a kind of inspector-general for algorithms who'll make sure they're not corrupted to punish the undeserving or line someone's pockets unjustly. But they also talk about the fact that these algorithms are likely to be illegible -- the product of a continuously evolving machine-learning system -- and that no one will be able to tell you why a certain person was denied credit, refused insurance, kept out of a university, or blackballed for a choice job. And when you get into a world where you can't distinguish between an algorithm that gets it wrong because the math is unreliable (a "fair" wrong outcome) from an algorithm that gets it wrong because its creators set out to punish the innocent or enrich the undeserving, then we can't and won't have justice. We know that computers make mistakes, but when we combine the understandable enthusiasm for Big Data's remarkable, counterintuitive recommendations with the mysterious and oracular nature of the algorithms that produce those conclusions, then we're taking on a huge risk when we put these algorithms in charge of anything that matters.
You may have heard that Amazon is selling a "KEEP CALM AND RAPE A LOT" t-shirt. How did such a thing come to pass? Well, as Pete Ashton explains, this is a weird outcome of an automated algorithm that just tries random variations on "KEEP CALM AND," offering them for sale in Amazon's third-party marketplace and printing them on demand if any of them manage to find a buyer.
The t-shirts are created by an algorithm. The word “algorithm” is a little scary to some people because they don’t know what it means. It’s basically a process automated by a computer programme, sometimes simple, sometimes complex as hell. Amazon’s recommendations are powered by an algorithm. They look at what you’ve been browsing and buying, find patterns in that behaviour and show you things the algorithm things you might like to buy. Amazons algorithms are very complex and powerful, which is why they work. The algorithm that creates these t-shirts is not complex or powerful. This is how I expect it works.
1) Start a sentence with the words KEEP CALM AND.
2) Pick a word from this long list of verbs. Any word will do. Don’t worry, I’m sure they’re all fine.
3) Finish the sentence with one of the following: OFF, THEM, IF, THEM or US.
4) Lay these words out in the classic Keep Calm style.
5) Create a mockup jpeg of a t-shirt.
6) Submit the design to Amazon using our boilerplate t-shirt description.
7) Go back to 1 and start again.
There are currently 529,493 Solid Gold Bomb clothing items on Amazon. Assuming they survive this and don’t get shitcanned by Amazon I wouldn’t be at all surprised if they top a million in a few months.
It costs nothing to create the design, nothing to submit it to Amazon and nothing for Amazon to host the product. If no-one buys it then the total cost of the experiment is effectively zero. But if the algorithm stumbles upon something special, something that is both unique and funny and actually sells, then everyone makes money.
Commit Logs From Last Night: highlights funny, profane source-code commit-messages from GitHub, as bedraggled hackers find themselves leaving notes documenting their desperate situations. Some recent ones:
WHY THE GODDAMMIT WHY WHY WHY HAROGIHAROGIAHRGOIA FUCK ME
render testing I DREW SOME LINES! reverted render panel to grew (white looks shit)
Merge pull request #15 from ruvetia/font_awesome_is_fucking_awesome include font-awesome into the projcet
Johns Hopkins computer science prof Peter Fröhlich grades his students' tests on a curve -- the top-scoring student gets an A, and the rest of the students are graded relative to that brainiac. But last term, his students came up with an ingenious, cooperative solution to this system: they all boycotted the test, meaning that they all scored zero, and that zero was the top score, and so they all got As. The prof was surprisingly cool about it:
Fröhlich took a surprisingly philosophical view of his students' machinations, crediting their collaborative spirit. "The students learned that by coming together, they can achieve something that individually they could never have done," he said via e-mail. “At a school that is known (perhaps unjustly) for competitiveness I didn't expect that reaching such an agreement was possible.”
The story of the boycott is a sterling example of how computer networks solve collective action problems -- the students solved a prisoner's dilemma in a mutually optimal way without having to iterate, which is impressive:
“The students refused to come into the room and take the exam, so we sat there for a while: me on the inside, they on the outside,” Fröhlich said. “After about 20-30 minutes I would give up.... Then we all left.” The students waited outside the rooms to make sure that others honored the boycott, and were poised to go in if someone had. No one did, though.
Andrew Kelly, a student in Fröhlich’s Introduction to Programming class who was one of the boycott’s key organizers, explained the logic of the students' decision via e-mail: "Handing out 0's to your classmates will not improve your performance in this course," Kelly said.
"So if you can walk in with 100 percent confidence of answering every question correctly, then your payoff would be the same for either decision. Just consider the impact on your other exam performances if you studied for [the final] at the level required to guarantee yourself 100. Otherwise, it's best to work with your colleagues to ensure a 100 for all and a very pleasant start to the holidays."
Fröhlich's changed the grading system -- but he's also now offering the students a final project instead of a final exam, should they choose.
Dangerous Curves [Zack Budryk/Inside Higher Ed]
Here's a must-read story from Tech Review about the thriving trade in "zero-day exploits" -- critical software bugs that are sold off to military contractors to be integrated into offensive malware, rather than reported to the manufacturer for repair. The stuff built with zero-days -- network appliances that can snoop on a whole country, even supposedly secure conversations; viruses that can hijack the camera and microphone on your phone or laptop; and more -- are the modern equivalent of landmines and cluster bombs: antipersonnel weapons that end up in the hands of criminals, thugs and dictators who use them to figure out whom to arrest, torture, and murder. The US government is encouraging this market by participating actively in it, even as it makes a lot of noise about "cyber-defense."
Exploits for mobile operating systems are particularly valued, says Soghoian, because unlike desktop computers, mobile systems are rarely updated. Apple sends updates to iPhone software a few times a year, meaning that a given flaw could be exploited for a long time. Sometimes the discoverer of a zero-day vulnerability receives a monthly payment as long as a flaw remains undiscovered. “As long as Apple or Microsoft has not fixed it you get paid,” says Soghioan.
No law directly regulates the sale of zero-days in the United States or elsewhere, so some traders pursue it quite openly. A Bangkok, Thailand-based security researcher who goes by the name “the Grugq” has spoken to the press about negotiating deals worth hundreds of thousands of dollars with government buyers from the United States and western Europe. In a discussion on Twitter last month, in which he was called an “arms dealer,” he tweeted that “exploits are not weapons,” and said that “an exploit is a component of a toolchain … the team that produces & maintains the toolchain is the weapon.”
The Grugq contacted MIT Technology Review to state that he has made no “public statement about exploit sales since the Forbes article.”
Some small companies are similarly up-front about their involvement in the trade. The French security company VUPEN states on its website that it “provides government-grade exploits specifically designed for the Intelligence community and national security agencies to help them achieve their offensive cyber security and lawful intercept missions.” Last year, employees of the company publicly demonstrated a zero-day flaw that compromised Google’s Chrome browser, but they turned down Google’s offer of a $60,000 reward if they would share how it worked. What happened to the exploit is unknown.
Welcome to the Malware-Industrial Complex [Tom Simonite/MIT Technology Review]
(via O'Reilly Radar)
On Coinheist.com, a crossword puzzle you solve by interpreting regular expressions.
A fascinating article in The Verge looks at the history of casino cheating and talks to Ted Whiting, director of surveillance at the Aria casino in Vegas, who specced out a huge, showy CCTV room with feeds from more than 1,100 cameras. They use a lot of machine intelligence to raise potential cheating to the attention of the operators.
Despite that, Whiting says facial recognition software hasn’t been of much use to him. It’s simply too unreliable when it comes to spotting people on the move, in crowds, and under variable lighting. Instead, he and his team rely on pictures shared from other casinos, as well as through the Biometrica and Griffin databases. (The Griffin database, which contains pictures and descriptions of various undesirables, used to go to subscribers as massive paper volumes.) But quite often, they’re not looking for specific people, but rather patterns of behavior. "Believe it or not, when you've done this long enough," he says, "you can tell when somebody's up to no good. It just doesn't feel right."
They keep a close eye on the tables, since that’s where cheating’s most likely to occur. With 1080p high-definition cameras, surveillance operators can read cards and count chips — a significant improvement over earlier cameras. And though facial recognition doesn’t yet work reliably enough to replace human operators, Whiting’s excited at the prospects of OCR. It’s already proven useful for identifying license plates. The next step, he says, is reading cards and automatically assessing a player’s strategy and skill level. In the future, maybe, the cameras will spot card counters and other advantage players without any operator intervention. (Whiting, a former advantage player himself, can often spot such players. Rather than kick them out, as some casinos did in the past, Aria simply limits their bets, making it economically disadvantageous to keep playing.)
With over a thousand cameras operating 24/7, the monitoring room creates tremendous amounts of data every day, most of which goes unseen. Six technicians watch about 40 monitors, but all the feeds are saved for later analysis. One day, as with OCR scanning, it might be possible to search all that data for suspicious activity. Say, a baccarat player who leaves his seat, disappears for a few minutes, and is replaced with another player who hits an impressive winning streak. An alert human might spot the collusion, but even better, video analytics might flag the scene for further review. The valuable trend in surveillance, Whiting says, is toward this data-driven analysis (even when much of the job still involves old-fashioned gumshoe work). "It's the data," he says, "And cameras now are data. So it's all data. It's just learning to understand that data is important."
One thing I wanted to see in this piece was some reflection on how casino level of surveillance, and the casino theory of justice (we spy on everyone to catch the guilty people) has become the new normal across the world.
Not in my house: how Vegas casinos wage a war on cheating [Jesse Hicks/The Verge]
Montreal comp sci student reports massive bug, is expelled and threatened with arrest for checking to see if it had been fixed
Ahmed Al-Khabaz was a 20-year-old computer science student at Dawson College in Montreal, until he discovered a big, glaring bug in Omnivox, software widely used by Quebec's junior college system. The bug exposed the personal information (social insurance number, home address, class schedule) of its users. When Al-Khabaz reported the bug to François Paradis, his college's Director of Information Services and Technology, he was congratulated. But when he checked a few days later to see if the bug had been fixed, he was threatened with arrest and made to sign a secret gag-order whose existence he wasn't allowed to disclose. Then, he was expelled:
“I was called into a meeting with the co–ordinator of my program, Ken Fogel, and the dean, Dianne Gauvin,” says Mr. Al-Khabaz. “They asked a lot of questions, mostly about who knew about the problems and who I had told. I got the sense that their primary concern was covering up the problem.”
Following this meeting, the fifteen professors in the computer science department were asked to vote on whether to expel Mr. Al-Khabaz, and fourteen voted in favour. Mr. Al-Khabaz argues that the process was flawed because he was never given a chance to explain his side of the story to the faculty. He appealed his expulsion to the academic dean and even director-general Richard Filion. Both denied the appeal, leaving him in academic limbo.
“I was acing all of my classes, but now I have zeros across the board. I can’t get into any other college because of these grades, and my permanent record shows that I was expelled for unprofessional conduct. I really want this degree, and now I won’t be able to get it. My academic career is completely ruined. In the wrong hands, this breach could have caused a disaster. Students could have been stalked, had their identities stolen, their lockers opened and who knows what else. I found a serious problem, and tried to help fix it. For that I was expelled.”
The thing that gets me, as a member of a computer science faculty, is how gutless his instructors were in their treatment of this promising student. They're sending a clear signal that you're better off publicly disclosing bugs without talking to faculty or IT than going through channels, because "responsible disclosure" means that bugs go unpatched, students go unprotected, and your own teachers will never, ever have your back.
Shame on them.
On Twitter's engineering blog, a fascinating description of how Twitter uses a blend of machine intelligence and Mechanical Turk tasks to figure out, in real time, what is going on in the world:
Before we delve into the details, here's an overview of how the system works.
- First, we monitor for which search queries are currently popular.
Behind the scenes: we run a Storm topology that tracks statistics on search queries.
For example, the query [Big Bird] may suddenly see a spike in searches from the US.
- As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon's Mechanical Turk service, and then polls Mechanical Turk for a response.
For example: as soon as we notice "Big Bird" spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.
- Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.
Jeremy Kun, a mathematics PhD student at the University of Illinois in Chicago, has posted a wonderful primer on probability theory for programmers on his blog. It's a subject vital to machine learning and data-mining, and it's at the heart of much of the stuff going on with Big Data. His primer is lucid and easy to follow, even for math ignoramuses like me.
For instance, suppose our probability space is and is defined by setting for all (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset , and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).
As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?
The FBI and Ernst and Young have released a list of top-ten phrases that indicate corporate fraud, based on data-mining evidence from real corporate fraud investigations.
In total more than 3,000 terms are logged by the technology, which monitors for conversations within the "fraud triangle", where pressure, rationalisation, and opportunity meet, said the FBI and Ernst & Young...
1. Cover up
2. Write off
4. Failed investment
5. Nobody will find out
6. Grey area
7. They owe it to me
8. Do not volunteer information
9. Not ethical
10. Off the books
Inception is a tool for breaking into computers with full-disk encryption. It assumes that you have access to a suspended/screen-locked computer whose disk is encrypted. You access the machine over its FireWire interface (or, if it doesn't have FireWire, you plug a FireWire card into one of its slots, and the machine will automatically fetch, install and configure the drivers, even if it's asleep), and then use the FireWire drivers to directly access system memory, and from there, patch the password-checking routine and walk straight into the computer.
This (and its predecessors, like winlockpwn) is a substantial advance on previous attacks against sleeping full-disk encrypted systems, which involved things like plunging the RAM into a bath of liquid nitrogen. As the author, Carsten Maartmann-Moe, points out, this can't be easily remedied with a FireWire driver update, since FireWire requires direct memory access to effect high-speed transfers.
So, two things: First, shut down your computer when it's not in your possession; second, "Inception" is an inspired name for an attack that breaks into the dreams of a sleeping computer, directly accesses its memory, and causes it to spill its secrets.
Inception’s main mode works as follows: By presenting a Serial Bus Protocol 2 (SBP-2) unit directory to the victim machine over the IEEE1394 FireWire interface, the victim operating system thinks that a SBP-2 device has connected to the FireWire port. Since SBP-2 devices utilize Direct Memory Access (DMA) for fast, large bulk data transfers (e.g., FireWire hard drives and digital camcorders), the victim lowers its shields and enables DMA for the device. The tool now has full read/write access to the lower 4GB of RAM on the victim. Once DMA is granted, the tool proceeds to search through available memory pages for signatures at certain offsets in the operating system’s password authentication modules. Once found, the tool short circuits the code that is triggered if an incorrect password is entered.
An analogy for this operation is planting an idea into the memory of the machine; the idea that every password is correct. In other words, the nerdy equivalent of a memory inception.
After running the tool you should be able to log into the victim machine using any password.
Wired's gallery of the paleolithic antecedents of today's social media technologies is a bit mismatched (some really interesting insights into today's media lineage, but mixed with some silliness), but the lead item, the Community Memory terminal from 1973, is pure gold. I wrote half an unsuccessful novel about this thing when I was about 25, and it's never stopped haunting me.
Three decades before Yelp and Craigslist, there was the Community Memory Terminal.
In the early 1970s, Efrem Lipkin, Mark Szpakowski and Lee Felsenstein set up a series of these terminals around San Francisco and Berkeley, providing access to an electronic bulletin board housed by a XDS-940 mainframe computer.
This started out as a social experiment to see if people would be willing to share via computer -- a kind of "information flea market," a "communication system which allows people to make contact with each other on the basis of mutually expressed interest," according to a brochure from the time.
What evolved was a proto-Facebook-Twitter-Yelp-Craigslist-esque database filled with searchable roommate-wanted and for-sale items ads, restaurant recommendations, and, well, status updates, complete with graphics and social commentary.
"This was really one of the very first attempts to give access to computers to ordinary people," says Marc Weber, the founding curator of the Internet History Program at the Computer History Museum in Mountain View, California.
Holy shit, that is a thing of beauty.
In "Credibility ranking of tweets during high impact events," a paper published in the ACM's Proceedings of the 1st Workshop on Privacy and Security in Online Social Media , two Indraprastha Institute of Information Technology researchers describe the outcome of a machine-learning experiment that was asked to discover factors correlated with reliability in tweets during disasters and emergencies:
The number of unique characters present in tweet was positively correlated to credibility, this may be due to the fact that tweets with hashtags, @mentions and URLs contain more unique characters. Such tweets are also more informative and linked, and hence credible. Presence of swear words in tweets indicates that it contains the opinion / reaction of the user and would have less chances of providing informa- tion about the event. Tweets that contain information or are reporting facts about the event, are impersonal in nature, as a result we get a negative correlation of presence of pronouns in credible tweets. Low number of happy emoticons [:-), :)] and high number of sad emoticons [:-(, :(] act as strong predictors of credibility. Some of the other important features (p-value < 0.01) were inclusion of a URL in the tweet, number of followers of the user who tweeted and presence of negative emotion words. Inclusion of URL in a tweet showed a strong positive correlation with credibility, as most URLs refer to pictures, videos, resources related to the event or news articles about the event.
Of course, this is all non-adversarial: no one is trying to trick a filter into mis-assessing a false account as a true one. It's easy to imagine an adversarial tweet-generator that suggests rewrites to deliberately misleading tweets to make them more credible to a filter designed on these lines. This is actually the substance of one of the cleverest science fiction subplots I've read: in Peter Watt's Behemoth, in which a self-modifying computer virus randomly hits on the strategy of impersonating communications from patient zero in a world-killing pandemic, because all the filters allow these through. It's a premise that's never stopped haunting me: the co-evolution of a human virus and a computer virus.
On the Internet Archive, a hi-rez scan of the 1983 Radio Shack computer catalog, which is a wonderland of jaw-dropping prices for prosumer equipment from my boyhood that doesn't even qualify as a toddler's toy today. I will always retain a fondness for acoustic couplers, though, as they were the way I first connected to a computer, running a screenless teletype terminal connected to my Dad's university PDP by means of one of these suction-cup wonders. There was something, I dunno, legible about being able to see how the Bell handset fit into that cradle, to hear the barely audible tinny whine of the characters crawling over the wire. It was like being able to watch nerve impulses travel from a brain to a distant limb.
But $995? Please.