The internet's other archive

The website archive.today (also accessible at archive.is) has become a backbone of the internet, providing on-demand archives of specific pages and access to paywalled sites that play the SEO game of revealing the full text of articles to non-human viewers. It's now a common Wikipedia citation! Jani Patokallio wrote up a history of the site and its unique position in the web's ecology.

The archive runs Apache Hadoop and Apache Accumulo. All data is stored on HDFS, textual content is duplicated 3 times among servers in 2 datacenters and images are duplicated 2 times. Both datacenters are in Europe, with OVH hosting at least one of them.
In 2012, the site already had 10 TB of archives and cost ~300 euros/mo to run, escalating to 2000 euros by 2014 and $4000 by 2016. As of 2021, they have archived on the order of 500 million pages, and with the average size of a webpage clocking in at well over 2 MB these days, that's a cool 1,000 TB to deal with. (For comparison, the Internet Archive is around 40,000 TB.)The less discussed but more controversial half of the site is scraping, the process of vacuuming up live webpages. Since 2021, this uses a modified version of the Chrome browser, and the blog readily admits that the availability of computing power to run these automated browsers is now the main bottleneck to expanding the site. To avoid detection, archive.today runs via a botnet that cycles through countless IP addresses, making it quite difficult for grumpy webmasters to stop their sites getting scraped. Access to paywalled sites is through logins secured via unclear means, which need to be replenished constantly: here's the creator asking for Instagram credentials.