For years, I used the very handy rsnapshot script for backup up data. However, rsnapshot cannot handle huge files that change a little over time very well. It seems that rdiff-backup works quite similar and can handle the huge-file scenario better, although accessing older is complicated. However, adding archfs may be the solution, so this is my try to setup a viable backup solution with rdiff-backup in combination with rdiff-backup-fs.
Backing up data is a very important task in the everydays life of a system administrator. Some key aspects of backup solutions are:
- Reliability: Backups have to be reliable, the best backup system is of no value if it “forgets” to backup data or one cannot read it back.
- Ease of Use / Usability: Backup system should be easy to set up, fully automated, simple to maintain and, which is very important, easy to restore data.
- No Proprietary Data Format: If a backup system stores data in a sophisticated, proprietary data format, this will lead to problems in case data has to be restored years later and the backup software is unavailable at that time.
- Differential Backup: To save space, backups should only store the differences between backups.
The above criteria will have a huge impact in the success of the backup solution.
One good solution that fulfills the above is rsnapshot, which works with the help of Unix hard-links and therefore simply stores the backups in the file system, whereas different trees resemble a snapshot for a specific date:
horn:/backup # tree -L 1
.
|-- daily.0
|-- daily.1
|-- daily.2
|-- daily.3
|-- daily.4
|-- daily.5
|-- daily.6
|-- hourly.0
|-- hourly.1
|-- hourly.2
|-- hourly.3
|-- weekly.0
|-- weekly.1
|-- weekly.2
`-- weekly.3
With the use of hard links, an unchanged file is stored only once in the file system, therefore no extra space is consumed.
A problem with this scenario emerges when there are files of huge size, which change only little over time. Examples for such files are:
- Database files, e.g. ZODB (Zope Object Database), PostgreSQL, MySQL, also database dumps that result in huge files
- Logfiles
- Huge documents which are changed regularily
- Outlook Mailboxes
I experienced a scenario where 10 users had an Outlook mailbox with 2GB each, resulting in 20GB of space. As seen above, the backup scenario has 15 snapshots and thus resulting in 300GB backup space for Outlook for 10 users only.
To circumvent the problem, the following two solutions can be made:
- Try to break up huge files into smaller ones: For instance, log files can be rotated on a regular basis, resulting in smaller log files and therefore mimimizing the problem. For Outlook, the mailbox could be splitted into multiple ones (archives), so that the file that changes is kept small. However, this requires the willingness of the user, which is not always possible.
- Use a different backup strategy for huge files: Huge files could be excluded from the snapshot backup and could be backed up in a different fashion, e.g. by storing only one backup copy, or by using a backup software that support differential backups on files. The problem with that solution is that the complexity of the overall backup strategy is increased.
A Possible Solution: rdiff-backup
Rdiff backup is on the other hand very simple to use. For testing, I created a tree like this:
test/
|-- a
| `-- test.txt
|-- b
| `-- test1.txt
`-- c
`-- test2.txt
A backup can be simply done like this:
mneme:~# mkdir backup
mneme:~# rdiff-backup test/ backup/
Now we can see that the backup command simply copied the original tree to the new directory:
backup
|-- a
| `-- test.txt
|-- b
| `-- test1.txt
|-- c
| `-- test2.txt
`-- rdiff-backup-data
|-- backup.log
|-- chars_to_quote
|-- current_mirror.2011-02-03T16:59:35+01:00.data
|-- error_log.2011-02-03T16:59:35+01:00.data
|-- extended_attributes.2011-02-03T16:59:35+01:00.snapshot
|-- file_statistics.2011-02-03T16:59:35+01:00.data.gz
|-- increments
| |-- a
| |-- b
| `-- c
|-- mirror_metadata.2011-02-03T16:59:35+01:00.snapshot.gz
`-- session_statistics.2011-02-03T16:59:35+01:00.data
In addition to the data, certain backup data is added, too, which holds information like metadata, like modification time, permissions and the like. If we now change a file and backup it again, the tree changes like this:
mneme:~# echo "File changed" >> test/a/test.txt
mneme:~# rdiff-backup test/ backup/
mneme:~# tree backup
backup
|-- a
| `-- test.txt
|-- b
| `-- test1.txt
|-- c
| `-- test2.txt
`-- rdiff-backup-data
|-- backup.log
|-- chars_to_quote
|-- current_mirror.2011-02-03T17:01:29+01:00.data
|-- error_log.2011-02-03T16:59:35+01:00.data
|-- error_log.2011-02-03T17:01:29+01:00.data
|-- extended_attributes.2011-02-03T16:59:35+01:00.snapshot
|-- extended_attributes.2011-02-03T17:01:29+01:00.snapshot
|-- file_statistics.2011-02-03T16:59:35+01:00.data.gz
|-- file_statistics.2011-02-03T17:01:29+01:00.data.gz
|-- increments
| |-- a
| | `-- test.txt.2011-02-03T16:59:35+01:00.diff.gz
| |-- a.2011-02-03T16:59:35+01:00.dir
| |-- b
| `-- c
|-- increments.2011-02-03T16:59:35+01:00.dir
|-- mirror_metadata.2011-02-03T16:59:35+01:00.diff.gz
|-- mirror_metadata.2011-02-03T17:01:29+01:00.snapshot.gz
|-- session_statistics.2011-02-03T16:59:35+01:00.data
`-- session_statistics.2011-02-03T17:01:29+01:00.data
One can see that a diff to of the changed file is added. The backup tree holds still the current version of the data. The following command can be used to restore the original tree:
mneme:~# mkdir test.old
mneme:~# rdiff-backup -r 1D backup/ test.old/
mneme:~# cat test.old/a/test.txt
This is a test
mneme:~#
Allthough this is great, restoring older files is not that simple as in rsnapshot, where someone can easily copy data from an older snapshot tree.
To solve this problem, a package called rdiff-backup-fs was created, which can be found at this location. This package uses the FUSE (Filesystem in Userspace) library and can create a tree similar to rsnapshot, whereas any backup is displayed as a readonly single tree. At the time of this writing, the package is relativley new, so it is unclear how stable this filesystem is. Unfortunately I was unable to compile this package as there seems to be some problem in the configure script and self-reference in header files. So this solution probably has to be postponed to a later date when the package is out of beta stage.
The following is a simple comparison between the two backup systems:
rsnapshot | rdiff-backup | |
Speed | Fast, but rsync creates high CPU loads | Slower than rsnapshot |
Size | Optimized through hard links, but increasing size for huge files that change often | Very good by using compressed deltas |
Programming Language | Perl | Python |
Restore | Very simple, each snapshot is in a seperate directory tree | Simple for the current backup (in a directory tree), must use tools to restore older files, rdiff-backup-fs not yet ready for production use |
Metadata | Stored on the file itself in the snapshot | Stored in a seperate container |
Data Transfer | rsync, ssh | rsync (via Python rsync library), ssh |
File Format | Plain file, simple copy operation | Plain file for current backup, compressed deltas for older versions |
Deletion of older Backups | Simple by deleting a snapshot tree | Only possible for backups older than a specific date, impossible to delete deltas in-between |
I personally would very much favour these two backup solutions combined in some way, e.g. by using rsnapshot for every file except files matching specific criteria, like a name pattern, file size and so on. These excepted files would then be backed up in a rdiff-backup style, whereas there are still snapshot trees as in rsnapshot. Combined with a stable filesystem on top like rdiff-backup-fs, restoring would be very easy, too.
As something like this is currently not available, my personal conclusion is to do a hybrid solution:
- Use rsnapshot wherever possible, as it is easier to restore and due to the simple copy operation without compressed deltas probably less error-prone.
- For huge files that change often, try to put them into a seperate directory tree, exclude this tree from the rsnaphost backup and use rdiff-backup instead.