Google+

A data backup system of one’s own

One of the biggest solo projects I’ve ever completed is Ef, my custom data backup system. I’ve written about data loss before, but that post glazed over the software I actually use to create backups. Up until a year ago, I used duplicity to back up my data. I still think duplicity is an excellent piece of software. Custom backup software had been on my to-do list for years, but I kept postponing the project because of how good my existing solution was already. I didn’t feel confident that I could produce something just as sophisticated and reliable on my own. But as time passed, I decided that until I dropped third-party solutions and wrote my own backup system, I’d never fully understand the inner workings of this critical piece of my personal infrastructure. So last June, I started writing a design doc, and a month later, I had a brand new data backup system that would replace duplicity for me entirely.

This remains one of the only times I took time off of work to complete a side project. Most of my projects are done on nights and weekends. At the beginning, I made consistent progress by implementing just one feature per night. Every night, I’d make sure the build was healthy before going to bed. But it soon became clear that unless I set aside time dedicated to this project, I’d never achieve the uninterrupted focus to make Ef as robust and polished as I needed it to be1. Throughout the project, I relied heavily on my past experience with filesystems and I/O, as well as information security, concurrent programming, and crash-consistent databases. I also learned a handful of things from the process that I want to share here, in case you’re interested in doing the same thing!

Scoping requirements

I think it’s a bad idea to approach a big software project without first writing down what the requirements are (doubly so for any kind of data management software). I started out with a list of what it means to me for files to be fully restored from a backup. In my case, I settled on a fairly short list:

  • Directory structure and file contents are preserved, of course
  • Original modification times are replicated
  • Unix modes are intact
  • Case-sensitive

In particular, I didn’t want Ef to support symbolic links or file system attributes that aren’t listed here like file owner, special modes, extended attributes, or creation time. Any of these features could easily be added later, but starting with a short list that’s tailored to my personal needs helped fight feature creep.

I also wrote a list of characteristics that I wanted Ef to have. For example, I wanted backups to be easily uploaded to Google Cloud Storage, which is where I keep the rest of my personal data repositories. But I also wanted fast offline restores. I wanted the ability to do dry run backups and the ability to verify the integrity of existing backups. I wanted to support efficient partial restores of individual files or directories, especially when the backup data needed to be downloaded from the cloud. Finally, I wanted fast incremental backups with point-in-time restore and I wanted all data to be securely encrypted and signed at rest.

Picking technologies

Do you want your backups to be accessible in 30 years? That’s a tall order for most backup systems. How many pieces of 30 year old technology can you still use today? When building Ef, I wanted to use a platform that could withstand decades of abuse from the biting winds of obsolescence. I ended up picking the Go programming language and Google’s Protocol Buffers (disclaimer: I also work here) as the basis of Ef. These are the only two dependencies that could realistically threaten the project’s maintainability into the future, and both are mature well-supported pieces of software. Compare that to the hot new programming language of the month or experimental new software frameworks, both of which are prone to rapid change or neglect.

I also picked Google Cloud Storage as my exclusive off-site backup destination. Given its importance to Google’s Cloud business, it’s unlikely that GCS will ever change in a way that affects Ef (plus, I sit just a few rows down from the folks that keep GCS online). Even so, it’s also easy enough to add additional remote backends to Ef if I ever need to.

For encryption, I went with GnuPG, which is already at the center of my other data management systems. I opted to use my home grown GPG wrapper instead of existing Go PGP libraries, because I wanted integration with GPG’s user keychain and native pinentry interfaces. GPG also holds an irreplaceable position in the open source software world and is unlikely to change in an incompatible way.

Naming conventions

Ef backs up everything under a directory. All the backups for a given directory form a series. Each series contains one or more backup chains, which are made up of one full backup set followed by an arbitrary number of incremental backup sets. Each backup set contains one or more records, which are the files and directories that were backed up. For convenience, Ef identifies each backup set with an auto-generated sequence number composed of a literal “ef” followed by the current unix timestamp. All the files in the backup set are named based on this sequence number.

These naming conventions are fairly inflexible, but this helps reduce ambiguity and simplify the system overall. For example, series names must start with an uppercase letter, and the name “Ef” is blacklisted. This means that sequence numbers and Ef’s backup directory itself cannot be mistaken for backup data sources. Putting all the naming logic in one place helps streamline interpretation of command-line arguments and validation of names found in data, which helps eliminate an entire class of bugs.

On-disk format

Ef creates just two types of backup files: metadata and archives. Each metadata file is a binary encoded protobuf that describes a backup set, its records, and its associated archives. If a record is a non-empty file, then its metadata entry will point to one or more contiguous ranges in the archives that constitute its contents. Individual archive ranges are sometimes compressed, depending on the record’s file type. Each archive is limited to approximately 200MB, so the contents of large records are potentially split among multiple archives. Finally, both the metadata and archives are encrypted and signed with GPG before being written to disk.

Incremental backup sets only contain the records that have changed since their parent backup set. If a file or directory was deleted, then the incremental backup set will contain a special deletion marker record to suppress the parent’s record upon restore. If only a record’s metadata (modification time or Unix mode) changes, then its incremental backup record will contain no archive pointers and a non-zero file size. This signals to Ef to look up the file contents in the parent backup set.

Backup replication is as easy as copying these backup files to somewhere else. Initially, backup files are created in a staging area in the user’s home directory. Ef has built-in support for uploading and downloading backup files to/from a Google Cloud Storage bucket. For faster restores, Ef can also extract individual records from a backup set stored on GCS without first writing the backup files to disk.

Metadata cache

In addition to the backup files, Ef maintains a cache of unencrypted backup set metadata in the user’s home directory in order to speed up read-only operations. This cache is a binary encoded protobuf containing all of the metadata from all known backup sets. Luckily, appending binary protobufs is a valid way to concatenate repeated fields, so new backup sets can simply append their metadata to the metadata cache upon completion. Naturally, the metadata cache can also be rebuilt at any time using the original backup metadata files. This is helpful when bootstrapping brand new hosts and in case the metadata cache becomes corrupted.

Archiving

The Ef archiver coordinates the creation of a backup set. Conceptually, the archiver maintains a pipeline of stream processors that transform raw files from the backup source into encrypted, compressed, and checksummed archive files. Ef feeds files and directories into the archiver, one at a time. As the archiver processes these inputs, it also accumulates metadata about the records and the generated archives. This metadata is written to the metadata file once the backup is complete. The archiver also features a dry run mode, which discards the generated archives instead of actually writing them to disk. To support incremental backups, the archiver optionally takes an list of preexisting records, which are generated by collapsing metadata from a chain of backup sets.

The components of the archiver pipeline include a GPG writer, a gzip writer, two accounting stream filters (for the archive and for the record range), and the file handle to the output file itself. Although the archiver itself processes files sequentially, Ef achieves some level of pipeline parallelism by running encryption in a separate process (i.e. GPG) from compression and bookkeeping, which are performed within Ef. The archiver’s destination is technically pluggable. But the only existing implementation just saves archives to the user’s home directory.

Slow mode

One of my original requirements for Ef was fast incremental backups. It’d be impossible to deliver this requirement if an incremental backup needed to read every file in the source directory to detect changes. So, I borrowed an idea from rsync and assumed that if a file’s size and modification time didn’t change, then its contents probably didn’t either. For my own data, this is true in virtually all cases. But I also added a “slow” mode to the archiver that disables this performance enhancement.

Unarchiving

As you guessed, the Ef unarchiver coordinates the extraction of files and directories from backups. The unarchiver contains mostly the same pipeline of stream processors as the archiver, but in reverse order. Because the records extracted by the unarchiver can come from multiple backup sets, the unarchiver is fed a list of records instead. All desired records must be provided upfront, so the unarchiver can sort them and ensure directories are processed before the files and subdirectories they contain. The unarchiver also features a dry run mode, which makes the unarchiver verify records instead of extracting them.

The unarchiver takes a pluggable source for archive data. Ef contains an implementation for reading from the user’s home directory and an implementation for reading directly from Google Cloud Storage. Since the unarchiver is only able to verify the integrity of records, Ef contains additional components to verify archives and metadata. Since records and archives are verified separately, a verification request for a full backup set involves two full reads of the archives: once to verify that each record’s data is intact and once to verify that the archives themselves are intact. Verification of an incremental backup set is only slightly faster: record verification will still require reading archives older than the backup set itself, but archive verification will only need to verify the archives generated as part of the incremental backup set. Usually, I run verification once in record only mode and again in chain archive only mode, which verifies the archives of all backup sets in a backup chain.

Google Cloud Storage

This isn’t my first Go program that uploads data to Google Cloud Storage, so I had some libraries written already. My asymmetric home internet connection uploads at a terribly slow 6Mbps, so as usual, I need to pick a concurrency level that efficiently utilizes bandwidth but doesn’t make any one file upload too slowly, to avoid wasted work in case I need to interrupt the upload. In this case, I settled on 4 concurrent uploads, which means that each 200MB archive upload takes about 5 minutes.

I usually let large uploads run overnight from my headless Intel NUC, which has a wired connection to my router. That way, uploads don’t interfere with my network usage during the day, and I don’t have to leave my laptop open overnight to upload stuff. If you have large uploads to do, consider dedicating a separate machine to it.

GCS supports sending an optional CRC32C for additional integrity on upload. It was easy enough have Ef compute this and send it along with the request. Theoretically, this would detect memory corruption on whichever GCS server was receiving your upload, but I’ve never seen an upload fail due to a checksum error. I suppose it’s probably more likely for your own computer to fumble the bits while reading your backup archives from disk.

GPG as a stream processor

Fortunately, I had already finished writing my GPG wrapper library before I started working on Ef. Otherwise, I’m certain I wouldn’t have finished this project on time. Essentially, my GPG library provides a Go io.Reader and io.Writer interface for the 4 primary GPG functions: encryption, decryption, signing, and verification. In most cases, callers will invoke two of these functions together (e.g. encryption and signing). The library works by launching an GPG child process with some number of pre-attached input and output pipes, depending on which functions are required. The “enable-special-filenames” flag comes in handy here. With this flag, you can substitute file descriptor numbers in place of file paths in GPG’s command line arguments. Typically, files 3 to 5 are populated with pipes, along with standard output. But standard input is always connected to the parent process’s standard input, in order to support pinentry-curses.

Both the GPG reader and writer are designed to participate as stream processors in a I/O pipeline. So, the GPG reader requires another io.Reader as input, and the GPG writer must write to another io.Writer. If the GPG invocation requires auxiliary streams, then the GPG library launches goroutines to service them. For example, one goroutine might consume messages from the “status-fd” output in order to listen for successful signature verification. Another goroutine might simply stream data continuously from the source io.Reader to the input file handle of the GPG process.

Modification time precision

Modern filesystems, including ext4 and APFS, support nanosecond precision for file modification times. However, not all tools replicate this precision perfectly2. Rather than trying to fix these issues, I decided that Ef would ignore the subsecond component when comparing timestamps for equality. In practice, precision higher than one second is unlikely to matter to me, as long as the subsecond truncation is applied consistently.

Command line interface

In case it wasn’t obvious, Ef runs entirely from the command line with no server or user interface component. I’m using Go’s built-in command line flag parser combined with the Google subcommands library. The subcommands that Ef supports include:

  • cache: Rebuild the metadata cache
  • diff: Print delta from latest backup set
  • history: Print all known versions of a file
  • list: Print all backup sets, organized into chains
  • pull: Download backup files from GCS
  • purge: Delete local copy of old backup chains3
  • push: Upload backup files to GCS
  • restore: Extract a backup to the filesystem
  • save: Archive current directory to a new backup set
  • series: Print all known series
  • show: Print details of a specific backup set
  • verify: Verify backup sets

Bootstrapping

Once the software was written, I faced a dilemma. Every 6 months, I reinstall macOS on my laptop4. When I still used duplicity, I could restore my data by simply installing duplicity and using it to download and extract the latest backup. Now that I had a custom backup system with a custom data format, how would I obtain a copy of the software to restore from backup?

My solution has two parts. Since I build Ef for both Linux and macOS, I can use my Intel NUC to restore data to my laptop and vice versa. It’s unlikely that I’ll ever reinstall both at the same time, so this works for the most part. But in case of an emergency, I stashed an encrypted copy of Ef for Linux on my Google Drive. Occasionally, I’ll rebuild a fresh copy of Ef and reupload it, so my emergency copy keeps up with the latest features and bug fixes. The source code doesn’t change very rapidly, so this actually isn’t much toil.

Comparison

Now that I’ve been using Ef for about 9 months, it’s worth reflecting on how it compares to duplicity, my previous backup system. At a glance, Ef is much faster. The built-in naming conventions make Ef’s command line interface much easier to use without complicated wrapper scripts. Since duplicity uses the boto library to access GCS, duplicity users are forced to use GCS’s XML interoperability API instead of the more natural JSON API. This is no longer the case with Ef, which makes authentication easier.

I like that I can store backup files in a local staging directory before uploading them to GCS. Duplicity allows you to back up to a local directory, but you’ll need a separate solution to manage transfer to and from remote destinations. This is especially annoying if you’re trying to do a partial restore while managing your own remote destinations. I can’t be sure which backup files duplicity needs, so I usually download more than necessary.

There are a few minor disadvantages with Ef. Unlike duplicity, Ef only stores files in their entirety instead of attempting to calculate deltas between subsequent versions of the same file with librsync. This makes Ef theoretically less efficient with storage space, but this makes almost no difference for me in practice. Additionally, there are plenty of restrictions in Ef that make it unsuitable to be used by more than one person. For example, file ownership is intentionally not recorded in the metadata, and timestamp-based sequence numbers are just assumed to be globally unique.

Overall, I’ve found that Ef fits really well into my suite of data management tools, all of which are now custom pieces of software. I hope this post gave you some ideas about your own data backup infrastructure.

  1. In total, Ef represents about 5000 lines of code, not including some personal software libraries I had already written for other projects. I wrote about 60% of that during the week of Independence Day last year. ↩︎
  2. In particular, rsync and stat on macOS don’t seem to consider nanoseconds at all. ↩︎
  3. There is intentionally no support for purging backup chains from GCS, because the cloud is infinite of course. ↩︎
  4. This helps me enforce good backup hygiene. ↩︎

2 CommentsAdd one

Kill joy
Fri, 07 Jun 2019 16:34:43 GMT

I love to read comments

Stanford Student
Fri, 26 Apr 2019 02:27:15 GMT

I found a flaw in your “theory”. If Ef is run through a DSL cable, I can (or someone else) can easily track the footprint left behind. From there, I can gain access to any files in the librsync. The JSON API can also become a potential threat for the files in your connected google drive.

Roger: Haha

Post a Comment

Sat, 21 Dec 2024 16:16:53 GMT