Promise and peril of data-scraping

Josh McHugh's Wired feature, "Should Web Giants Let Startups Use the Information They Have About You?," is a meaty, thinky piece about the many risks of data-scraping. The piece investigates the risks to users (your data, slithering around the net), the risks to scrapers (your business entirely dependent on someone else's goodwill), and the risks to scrapees (bandwidth clobbering, your users get screwed and so on):
Giants like Yahoo and Google have thus far taken a mostly nonproprietary stance toward their data, typically letting outside developers access it in an attempt to curry favor with them and foster increased inbound Web traffic. Most of the largest Web companies position themselves as benign, bountiful data gardens, supplying the environment and raw materials to build inspired new products. After all, Google itself, that harbinger of the Web2.0 era, thrives on info that could be said to "belong" to others -- the links, keywords, and metadata that reside on other Web sites and that Google harvests and repositions into search results.

But beneath all the kumbayas, there's an awkward dance going on, an unregulated give-and-take of information for which the rules are still being worked out. And in many cases, some of the big guys that have been the source of that data are finding they can't -- or simply don't want to -- allow everyone to access their information, Web2.0 dogma be damned. The result: a generation of businesses that depend upon the continued good graces of a relatively small group of Internet powerhouses that philosophically agree information should be free -- until suddenly it isn't.



  1. This was a good article, and I wonder where the privacy issue will take us. Charles Stross has made an issue out of all the CCTs in London, watching your every move. Which would be fine with me if that happens in all cities. Privacy is something I don’t think I’m going to have more of in the future, but less.


  2. Ah, the joys of “information wants to be free”. Just wait until some loser murders his wife in her domestic violence shelter whose location used to be shielded from public view. Google recently had ours on display in their directory, with a friggin’ photo of the front door. Getting it removed was extremely difficult, because mortals can’t contact Google; they contact you. Like the gods.

  3. Once upon a time, I scraped an external archive of craigslist and made some message frequency/time by category plots. The external archive contained data from 1998 through 2001 making for some very interesting pictures of a dot.boom and bust. Slashdot got wind of it and tens of thousands of folks came to look at the graphs.

    Later I wanted to legitimize it and plot data in realtime. (Given the flip-flopping of the dotconomy over the past five years, it would have been very interesting.)

    In a foolish choice to be a “good Internet citizen”, I e-mailed the craigslist folks to ask for permission. I figured “Hey. Any community minded organization wouldn’t mind, especially an organization like craigslist.” That was true of Craig Newmark as he tried to help. However, their douchebag of a CEO wouldn’t budge, claiming that there was absolutely no way I could possibly collect the data without crashing their servers. (Even when I suggested well spaced RSS pulls, he rudely said no.)

    It really made me realize that no matter how much these guys like to present themselves as “community minded,” at the end of the day, when it comes to their data, they’re obnoxious closed businessmen all the same.

  4. CORRECTION: I just looked at the old e-mail thread where this transpired. It wasn’t the CEO that put the kabash on this, it was one of their techies. Nevertheless, he was on the thread and never intervened, which is almost but not quite as lame as I made him out to be in my prior comment.

Comments are closed.