Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to email@example.com). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
An excellent decision. To be clear, they're ignoring robots.txt even if you explicitly identify and disallow the Internet Archive. It's a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.
Could Russia teach us something about how to deal with difficult aspects of our national history? Many places in the South – from New Orleans to Louisville – are in the process of bringing down statues that glorify the Confederacy. That process raises questions about what to do with these remnants of the past. Do […]
It’s been a bumper year for documentary evidence of the lost, weird history of MAD Magazine: first there was the gorgeous hardcover that uncovered the two-issue, unlimited-budget Trump Magazine (created by MAD’s founding editor Harvey Kurtzman after a falling out with publisher William Gaines, Jr, operating with a bankroll provided by Hugh “Playboy” Hefner); now there’s Behaving Madly, which assembles a timeline of the short-lived, incredibly proliferated MAD rip-offs that popped up as Kurtzman and his successor proved that there was big bucks to be found in satire.
Josh Jones at Open Culture looks at the Speyer wine bottle, the oldest (and possibly grossest) unopened bottle of wine.
The Pry.Me Bottle Opener holds tens of thousands of times its own weight, and you can pick one up now from the Boing Boing Store.This remarkable keychain is considerably smaller than any of your keys, but don’t let that fool you: it can easily open any bottle, and could even tow a trailer full of […]
Guaranteeing your privacy online goes way beyond checking the “Do Not Track” option in your browser’s settings. To ensure that your internet activity is totally hidden from Internet Service Providers, advertisers, and other prying eyes, take a look at Windscribe’s VPN protection. It usually costs $7.50 per month, but you can get a 3-year subscription […]
This project management bundle will help you get organized and learn how to lead a team to success. You can pay what you want for these five courses when you pick them up from the Boing Boing Store.To help you become an invaluable asset for your company, this bundle includes a curated collection of professional […]