Ken (Chanoch) Bloom's Blog

19th July 2009

Backup requirements and backup strategies

A slashdot article has asked what is the best backup stragtegy for home users in this day and age when hard drive space and peoples' media collections far outpace the sizes of removable media used for backups just a few years ago. I commented there, and then decided to turn my comment into a blog post, because the general principles are important.

To decide how best to back up, we must lay out the kinds of failures that can occur and goals of a backup. (I keep my documents in a constellation of git repositories, so many of my backup needs are covered by replicating the repositories to several places, and some of my examples are based on my experience with git.)

  1. We would like to protect against mechanical drive failure. This can be done with a RAID.

  2. We may also want to protect against the failure of other components of the computer. My primary computer (the one that holds my master git repositories, and is the center of my star for replicating my live copies of my data, and takes care of my email downloading and mail filtering) recently died because its motherboard died. The hard drive was totally intact, but it took about two weeks to get a new computer, and in the meantime, I still needed something to perform the functions of this computer without losing productivity. When the new computer came, it had a brand new generation of most of the technologies on the motherboard, switching from x86 to AMD64, and from IDE to SATA, and after an additional week that it took me to borrow an appropriate adapter I could restore anything I wanted from the old hard drive.

  3. We would like to protect against accidental deletion of files, file corruption, or edits to a file that we have now reconsidered. This can be done with snapshotting. In source code, to reconsider and edit to a file is fairly common, and is the reason why most programming projects use revision control systems. Other options like nilfs or ZFS snapshots can also fill this goal. This goal is accomplished more easily if the backups area automatic and the backup device is live on the system.

    Depending on your needs, this goal may be counterbalanced by a need to not retain the history of files for legal or other reasons, and this should inform your choice of backup strategy.

  4. We would like to protect against filesystem corruption, whether by an OS bug, or by accidentally doing cat /dev/random > /dev/hda. This can be done by having an extra drive of some sort that isn't normally hooked up to the computer. Tape drives, CDs, and DVDs have traditionally fulfilled this purpose, and this is where the use of additional hard drives is being suggested. Remote backups, via rsync or git, can also accomplish this. When deciding whether to do this remotely or locally, consider the amount of data you're backing up, the size of backing up incremental changes, the size of the initial upload, and whether you have a one-off way of getting more bandwidth for the initial upload.

  5. We would like to protect against natural disasters. For someone living in New Orleans, it would be nice to have a backup somewhere outside the path of Hurricane Katrina. Remote backups may be pretty much the only way to accomplish this, unless you're a frequent traveler and can hand-deliver backup media to remote locations.

  6. In addition to any of the above, the code you use create said backup may be buggy, or may become buggy or misconfigured or obsolete over time. Checking the integrity and restorability of your backups after creating them, and keeping several (independent) previous versions of a backup, at least for a short time, may help here.

You may not be concerned with the various modes of failure described here occuring simultaneously. For example, it may be unlikely that you need to deal with file system corruption at the same time that you regret one of the edits you made on your file. In that case, your offline backup device doesn't need to hold all of your snapshots.

Also, consider the importance of the data you are backing up, and your ability to regenerate it as needed. For example, I use Debian Linux. Pretty much any software I need to restore is available from Debian's mirrors (for free), so there's no need to backup the software I use or the operating system. I can content myself with backing up /etc, and /home, and knowing that anything else is out there in the cloud because hundreds of other people are using it.

After that, there's stuff that's just not that important, I'm more willing to permentantly lose 2GB of photos, than the few megabytes that is the core of my Ph.D. thesis research.

And there's also a diary and GPG keys that (though important) I'd rather lose permenantly than have anywhere other than my one primary computer.

No backup strategy is perfect. There's a story about how a five-year old password foiled one company's otherwise immaculate backup scheme.

Permalink | linux.
My Website Archives

Tags