Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to email@example.com). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
An excellent decision. To be clear, they're ignoring robots.txt even if you explicitly identify and disallow the Internet Archive. It's a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.
Japanese historian Nick Kapur unearthed "Osanaetoki Bankokubanashi" (童絵解万国噺), a wonderfully bizarre illustrated Japanese history of the USA from 1861, filled with fanciful depictions of allegedly great moments in US history, like "George Washington defending his wife 'Carol' from a British official named 'Asura' (same characters as the Buddhist deity)."
Stanford folklorist and science historian Adrienne Mayor has a fascinating-sounding new book out, titled “Gods and Robots: Myths, Machines, and Ancient Dreams of Technology.” It’s a survey of how ancient Greeks, Romans, Indian, and Chinese myths imagined and grappled with visions of synthetic life, artificial intelligence, and autonomous robots. From Mayor’s interview at Princeton University […]
Tim Wu (previously) is best known for coining the term "Net Neutrality" but the way he got there was through antitrust and competition scholarship: in his latest book, The Curse of Bigness: Antitrust in the New Gilded Age, Wu takes a sprightly-yet-maddening tour through the history of competition policy in the USA, which has its […]
Just a reminder: Print isn’t dead. And now that printers are becoming as portable as cell phones, it might be around for quite some time. Enter the MEMOBIRD Mobile Thermal Printer, a mini-printer that is versatile, portable – and most importantly, never needs a refill on ink or toner. Measuring just a few inches around, […]
What do Facebook, Twitter, YouTube and Google all have in common? Somewhere in their framework, they all use MySQL, that most versatile (and free!) of database management systems. And they’re not alone. If your company or the one you’d like to work for wrangles data (and who doesn’t?), they’re going to need someone with a […]
There’s a reason you’re hearing about the gig economy in every other business story these days. More than ever, people are finding income from more than one source. And if you find the right one, a side hustle can do more than just pad your pockets – it can allow you to finally get paid […]