Tahoe-LAFS: a P2P filesystem that lets you use the cloud without trusting it

Zooko sez,
Tahoe-LAFS is a p2p filesystem. You pool your spare hard drive space together with that of your friends. This forms a distributed filesystem which endures even if some of your friends' computers are unreachable. Everything is automatically encrypted, so backing up your files onto the distributed filesystem doesn't necessarily mean sharing the files with your friends. But, it is easy to share specific files or directories with specific friends.

It comes with a command-line interface and a web interface. If you choose, you can allow remote HTTP clients to connect to the web interface. We've configured our test grid to do that so that you can take Tahoe-LAFS for a test drive just by clicking here.

Please try it out and contribute bug reports! We are an all-volunteer project of Free Software hackers in the public interest. We need encouragement, love, and bug reports.

This looks like some exciting stuff! From the announcement:
In addition to the core storage system itself, volunteers have developed related projects to integrate it with other tools. These include frontends for Windows, Macintosh, JavaScript, and iPhone, and plugins for Hadoop, bzr, duplicity, TiddlyWiki, and more. As of this release, contributors have added an Android frontend and a working read-only FUSE frontend. See the Related Projects page on the wiki [3].

We believe that the combination of erasure coding, strong encryption, Free/Open Source Software and careful engineering make Tahoe-LAFS safer than RAID, removable drive, tape, on-line backup or other Cloud storage systems.

ANNOUNCING Tahoe, the Least-Authority File System, v1.6 (Thanks, Zooko!)

(Image: King Cloud, a Creative Commons Attribution ShareAlike photo from akakumo's photostream)


    1. It literally IS mojonation, just a few versions newer. (ok ok, there may have been a rewrite but it’s the same thing)

    1. With traditional backups, you store a copy of your own files on external media. The problem is this is that media could get lost, stolen or destroyed (your house could burn down). You could keep another copy at the office or a friend’s house, but it could still get lost or stolen. Also, you would have to trust your friend or coworkers not to take a peek at your data.

      There are so called “cloud based” backup solutions, such as Amazon’s EC2. They provide backups over the Internet, and have multiple geographically distributed backups. This is a great solution, but there is still the problem that you must trust Amazon not to take a peek at your data, or to let others (such as the feds) do so.

      What Tahoe-LAFS does is implement the same type of distributed remote backup technology that other cloud backup services do, but it also has a layer of encryption on your data designed so that you are the only person who can view that data. That way you get the privacy of personal at-home backups, but the reliability and redundancy of professional remote backup services.

  1. I currently accomplish “safety” in the cloud by using the duplicity tool to store data at a standard provider (in my case, rsync.net). The only problem with this is that I am effectively locked into this particular provider, as very few providers offer plain old SFTP as a transport.

    I’d love to use Tahoe-LAFS with an established, fixed infrastructure. They’ve done some clueful FOSS-related things in the past, and I wonder if they would implement this…

  2. zog: Tahoe is designed by some of the same people who developed MojoNation and Mnet (including Zooko). In fact it’s the second or third rewrite of the MojoNation code base, after Mnet and “Mountain View”. This article is slightly dated but gives a good summary of the design.

    Anonymous: Tahoe does support SFTP. Currently this is a bit difficult to set up and may have some bugs, but I think getting SFTP working properly is likely to be a priority for the next version. allmydata.com is a commercial backup provider that uses Tahoe-LAFS.

    1. Not at all like the LOCKSS system, which is a tool allowing libraries to collaborate to preserve published, copyright material for the long term (with permission from the copyright holder), not a backup system. The LOCKSS system does not use either erasure coding or encryption, both of which are dangerous for long-term digital preservation.

      1. I’m curious, how is erasure coding dangerous for long-term digital preservation?

        Is it just that a large number of full copies is preferable?

        My understanding of the purpose of erasure coding is that it allows a flexible (at coding time) trade-off to be made between redundancy and storage size. I.e. without erasure coding you can have redundancy in whole number increments at a storage cost of the size of a full copy on each node. With erasure coding you can choose to require only n out of m sources at a storage cost less than n (or is it m?) full copies.

        I can see how in the very long term, where it’s possible that less than n nodes survive, and the encoding scheme may not be in common use any more the material could be lost where even one full (and not otherwise encrypted or even just anachronistically encoded) copy would preserve it…

      2. I’m one of the contributors to Tahoe-LAFS. I disagree that encryption and erasure-coding are dangerous for long-term digital preservation. Making sure you don’t lose all copies of the decryption key is easier than making sure that you don’t lose all copies of the file because the key is smaller. Engrave it on a steel plate.

        In fact, encryption can make long-term digital preservation safer, because it allows the set of people whom you can ask to store the ciphertext to be larger than the set of people who are allowed to read the files.

        Likewise with erasure coding — I’ve read papers by archivists arguing that fancy data formats are a potential problem, and I appreciate the argument, but the erasure-coding used in Tahoe-LAFS is bog standard Reed-Solomon, which was invented decades ago and has been implemented many times. I believe the added robustness of being able to lose most of your storage and still recover all the data is worth cost of a layer of Reed-Solomon.

        Cyborg archaeologists digging through our rubble a hundred years from now are going to have no problem with the erasure coding. They might have a problem with the encryption, so make a couple of duplicates of that steel plate.

  3. This reminds of the the Freenet project. It’s not exactly the same, but it has a lot of similarities. It’s a P2P system, using distributed hosting of files, to ensure high reliability, but instead of adding encryption, they add anonymity. The idea is that people could post dissenting or otherwise questionable content without fear of censorship or retribution.

  4. Concerns:

    1. Could you be held liable for unknowingly hosting material on your system — e.g., someone else’s child pr0n?

    2. Could your computer be confiscated by authorities investigating a crime committed by someone else, whose info might have been stored on it?

    1. 1. If you’re not knowingly distributing any illegal material, then you can’t get in trouble for doing so.

      2. Sure you could, if they needed your logs for network analysis. They might return it though. Maybe even in once piece. They might not even install any back doors on it.

      Freenet is a bit like this, but includes anonymity as well as encryption, so you would be unlikely to get your machine seized.

    2. There is no requirement that you host any files for anyone to use this. Unlike, say, bittorrent, you are welcome to be purely a client and not also provider in a Tahoe-LAFS system.

      If you want to contribute storage in a reciprocal manner but are concerned about such things then, as suggested in the article, consider setting up a “friends net” where you and some friends or family supply reciprocal backup storage. That way you can be reasonably certain that your buddy so and so or your uncle whats his name are not going to expose you to such liability (because you know and trust their character) but also be assured that your data is safe from snooping — not so much by your friends and relatives as by anyone who may have compromised their computers. I certainly trust my friends and relative’s moral judgment much further than I trust their IT security prowess!

  5. Encryption of secrets is really good for preserving them both the secrecy and the data!

    Encryption of public data is counter productive and even destructive.

    So say, a library digitising its collection would be required to encrypt those parts not in the public domain, if it’s allowed to even digitise in the first place.

    But the PD parts would be best preserved by publishing publicly.

    There’s nothing at all wrong, for secrets or no, with Reed-Solomon. If the future people can’t interpret R-S, then they’ll never even understand CDs!

  6. Have you considered more modern erasure coding systems? Particularly fountain or LDPC codes? Fountain codes seem particularly well suited to this type of problem.

  7. Yes! An ancestor of Tahoe-LAFS (the first version of allmydata.com, which was never open sourced) used Digital Fountain’s proprietary erasure codes. However it turns out that this provided no actual advantage over Reed-Solomon in the Tahoe-LAFS architecture. Reed-Solomon (with the zfec implementation that we use) is sufficiently fast that it is hard to even measure the time spent doing erasure coding on a live server.

  8. If security and privacy of data is a concern, and it should be, companies of all sizes should familiarize themselves with Next Generation Backup. If you run a few servers I assume you are virtualized already. Online services have a big bottleneck wen backing up increasing numbers of VM’s

    If you add a need for replication to the equation, then the case for backup to disk is more compelling.

    If you add data deduplication on top of that, it is actually easy to prove that TCO is lower, complexity is less and automation is easier when you use a Next Generation Backup approach.

Comments are closed.