Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to firstname.lastname@example.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
An excellent decision. To be clear, they're ignoring robots.txt even if you explicitly identify and disallow the Internet Archive. It's a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.
My wife -- whose father is a TV director who'd worked for the BBC -- learned as a little girl that the British spy agency MI5 secretly vetted people who applied for work at the BBC and denoted possible subversives by putting a doodle of a Christmas tree on their personnel files; people who were […]
So, there’s this skeleton that archaeologists discovered in Italy during the mid-1990s. They reckon the man, who became the skeleton, was alive somewhere between the sixth and eighth century. Those were hard times. Life was short and seldom sweet. In the case of our man the skeleton, somewhere along the line, he lost his hand. […]
Mussolini commissioned this enormous scale model of Ancient Rome and it took 4 years to build. Surely, much of this is guesswork? [via] At the Museum of Roman Culture resides a 1:250 recreation of imperial Rome, known as the Plastico di Roma Imperiale, which transports viewers not just through space but time as well. “To […]
Total versatility isn’t something you’d typically find in a telescope. While magnification tech has come a long way, most telescopes are designed to either gaze upon the stars or view the landscapes beneath them. The Omegon Maksutov Telescope MightyMak 60 lets you do both, and thanks to its compact design, you can easily incorporate some sightseeing into […]
The web is an invaluable tool for connecting small businesses with their target audiences. However, when it comes to building a website and marketing online, the learning curve can be steep if you’re doing it on your own. The WordPress Essentials Lifetime Bundle can help you out by getting you up to speed with the platform […]
Most of us understand that when we visit a website, we’re subjecting ourselves to surveillance by trackers. And, while these tools are usually used for innocuous purposes, like determining which ads to show you, they can be leveraged for much more nefarious goals, and they have the potential to tank your browsing speed as well as […]