Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to email@example.com). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
An excellent decision. To be clear, they're ignoring robots.txt even if you explicitly identify and disallow the Internet Archive. It's a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.
Laura Gao was born in Wuhan before moving to the US at the age of 3. An experienced graphic designer who now works for Twitter, Gao has been — understandably — frustrated with the virulant racism that’s accompanied the worldwide outbreak of the novel coronavirus, and Trump’s continued insistence on blaming China for the virus. […]
The Dresden Panometer is a converted former gasometer that exhibits 360° panoramas created by the artist Yadegar Asisi. “The 15 m high visitor’s tower provides you with a 360-degree view from the tower of Dresden’s Town Hall and reveals the extent of the destruction in the panorama by Yadegar Asisi, almost 3,000 m² in size.” […]
Jack Nitzsche was a legend in his own time; an arranger, producer, songwriter, and Academy Award-winning composer. His disparate discography includes collaborations with Phil Spector, the iconic 1966 Batman theme, titles by The Rolling Stones, Doris Day, Ike & Tina Turner, The Monkees, Glen Campbell, and the Ronettes, as well as several film soundtracks, including […]
These toys and games can keep the kids busy while you’re all trapped inside. As rough as all this time cooped up inside the house is on us adults, it’s even worse for kids. All that borderline maniacal energy along with an unquenchable thirst for stimulation and attention make home sequestration like a life sentence […]
Python is everywhere. Just look under the hood of virtually every major tech player of the 21st century and you’re likely to find a whole lot of Python-based coding language staring back at you. Case in point: Netflix. You may not know it, but from its security protocols to its much-hyped recommendations, it turns out […]
There are definite benefits to the whole work from home thing. The commute is a breeze. The dress code is supremely casual. And your boss has to work a lot harder to actually find you. Despite the joys, there are still some clear downsides to the whole home office thing as well. Job focus can […]