Procod

Professional Software Development

Peer to Peer Backup

We present the outline of a method allowing reliable data storage (used for backup) over a network of untrusted and unreliable peers.

Every participating peer wants to store some amount of own data on other (remote) computers. He wants to be able to reliably recover this data at any moment. In exchange for this, every peer agrees to offer some amount of space on it's own HDD where data belonging to other peers will be stored. Normally the amount of own disk-space offered is significantly larger than the amount of own data stored in the network.

But we don't require reliability of the peers. A peer can dissapear from the network at any moment, and for any amount of time, even forever. Also a peer can intentionally delete the (other peers') data that is stored on his computer. Of course, any computer participating in the network can crash at any time, leading to complete loss of data.

Yet the data will be reliably stored in the system, withstanding the unreliability of the peers.

The data that is to be stored is divided in fixed-size blocks (e.g. 1MB blocks). To remotely store a block, the local peer finds another peer who also wants to remotely store a block. The two peers exchange their blocks.
The received block is stored the on the local disk, together with information of its origin and information about the (local) block that was given in exchange.

Every local block will be stored remotely multiple times, on different computers. For every remote instance of a block, another block is stored, in exchange, locally. Thus, if we want a "replication factor" of 10 (i.e., every local block is stored in 10 locations), we'll have to store locally 10 remote blocks. So the local disk space that we make available is 10x the size of our data that we store remotely.

The local peer will regularilly check the availability of its own blocks on remote computers. He will send a "query" message to the remote peer, asking for a specific small region (e.g. 16bytes of data from some offset) from the block. The local peer will verify the corectness of the answer by comparing it with its local version of the block. The remote peer will be able to correcty answer such queries only if it really stores the block on its disk.

The remote peer won't answer the query, or will answer wrong, in two cases:

Both cases are equally considered "negative" answers.

When receiving a negative answer, the local peer will take these actions:

The "Effective Replication Factor" means the number of copies which are normally available during a check. The local peer will create the number of replicas that is necessary to reach a target ERF. For example, if the peer computers are turned on, in average, 12h/24h, than in order to have an ERF of 10, every block will have to be stored on 20 computers (i.e. to have 20 replicas).

But for each replica, we also have to store on the local disk a corresponding remote block (taken in exchange for our block). So, in order to reduce the local disk usage, we have the interest to store our blocks on more available remote computers (e.g. which are available 24/24), because in this case we reach the target ERF with less replicas.

This has the effect that a computer with lower availability (e.g. which only participates in the network 2h/24h) will see his remote copies dropped more often, so he will need to create more replicas to insure a target ERF. For this, he will have to offer more local disk space. This creates a natural treadoff between availability and disk-space.

Reliability

By doing periodic checks, each computer insures that its blocks really have a target ERF (with some variation). This means that at any moment, each block can be found in a given number of replicas (e.g. between 8 and 12, for an ERF of 10). This way we insure that, at the moment when a block is required for backup-recovery, the probability of finding no copies of it available is extremelly small. Moreover, we can tune this probability by modifying the target ERF (higher the ERF, exponentially lower the probability of no-availability).

Attacks

The data sink
This is a peer which continously replicates its blocks, and never stores any of the blocks that he takes in exchange. The other computers will store his blocks for a little time, but because the attacker can't answer any query correctly, his blocks will be dropped. The net effect is a little increase of the disk usage of the other peers, with no impact on the availability. This attack requires a lot of bandwidth usage on the part of the attacker, and its effects are mild. But if many peers participate in such an attack, it can have effects similar to a DDOS attack.
The bad peer
This is a peer which reliably stores the data, and answers check queries correctly; but when asked to return data for backup-recovery, he will, exactly at that critical moment, refuse to return any data. I.e., this peer intentionally fails when he is most needed. To protect against such an attack, the requests for data (intended for recovery) must be indistinquishable from the check queries, so that the "bad peer" is unable to determinate if it's a regular check or a recovery.
The global conspiracy
When many peers, which form a conspiracy, agree to fail all at the same specific moment (supposedly when the victim is in need for recovery). Because the replicas are somewhat randomly distributed to the peers, to make the victim's data unavailable would require a large part of the peers to participate in the conspiracy (e.g. more than a half); so this attack is costly (unpractical) to mount.