Undeletable data storage

I buy stuff when I’m bored. Sometimes, I buy random computer parts to spark project ideas. This usually works really well, but when I bought a 2TB external hard drive the week before thanksgiving, I wasn’t really sure what to do with it. At the time, I was in college, and two terabytes was way more storage than I had in my laptop, but I knew that hard drives weren’t the most reliable, so I wasn’t about to put anything important on it. There just wasn’t that much I could do with only one hard drive. I ended up writing a multithreaded file checksum program designed to max out the hard drive’s sequential throughput, and that was that. Years later, after having written a few custom file backup and archival tools, I decided it’d be nice to have an “air gapped” copy of my data on an external hard drive. Since an external hard drive could be disconnected most of the time, it would be protected from accidents, software bugs, hacking, lightning, etc. Furthermore, it didn’t matter that I only had a single hard drive. The data would be an exact replica of backups and archives that I had already stored elsewhere. So, I gave it a try. Once a month, I plugged in my external hard drive to sync a month’s worth of new data to it. But after a few years of doing this, I decided that I had had enough of bending down and fumbling with USB cables. It was time for a better solution.

Essentially, I wanted the safety guarantees of an air gapped backup without the hassle of physically connecting or disconnecting anything. I wanted it to use hard drives (preferably the 3.5” kind), so that I could increase my storage capacity inexpensively. The primary use case would be to store backup archives produced by Ef, my custom file backup software. I wanted it to be easily integrated directly into my existing custom backup software as a generic storage backend, unlike the rsync bash scripts I was using to sync my external hard drive. There were lots of sleek-looking home NAS units on Amazon, but they came bundled with a bunch of third-party software that I can’t easily overwrite or customize. Enterprise server hardware would have given me the most flexibility, but it tends to be noisy and too big to fit behind my TV where the rest of my equipment is. I also considered getting a USB hard drive enclosure with multiple bays and attaching it to a normal computer or Raspberry Pi, but that would make it awkward to add capacity, and the computer itself would be a single point of failure.

Luckily, I ended up finding a perfect fit in the ODROID-HC2 ($54), which is a single board computer with an integrated SATA interface and 3.5” drive bay. Each unit comes with a gigabit ethernet port, a USB port, microSD card slot, and not much else, not even a display adapter. The computer itself is nothing special, given how many ARM-based mini computers exist today, but the inclusion of a 3.5” drive bay was something I couldn’t find on any other product. This computer was made specifically to be a network storage server. It not only allowed a hard drive to be attached with no additional adapters, but also physically secured the drive in place with the included mounting screws and its sturdy aluminum enclosure. So, I bought two units and got started designing and coding a new network storage server. Going with my usual convention of single-letter codenames, I named my new storage service “D”. In case you’re interested in doing the same thing, I’d like to share some of the things I learned along the way.

Undeletable

On D, data can’t be deleted or overwritten. I used external hard drives for years because their data could only be modified while they were actively plugged in. In order to achieve the same guarantees with a network-connected storage device, I needed to isolate it from the rest of my digital self. This meant that I couldn’t even give myself SSH login access. So, D is designed to run fully autonomously and never require software updates, both for the D server and for the operating system. That way, I could confidently enforce certain guarantees in software, such as the inability to delete or overwrite data.

This complete isolation turned out to be more difficult than I expected. I decided that D should consist of a dead simple server component and a complex client library. By putting the majority of the code in the client, I could apply most kinds of bug fixes without needing to touch the server. There are also some practical problems with running a server without login access. I ended up adding basic system management capabilities (like rebooting and controlling hard drive spindown) into the D server itself. The two HC2 units took several weeks to ship to my apartment, so I had plenty of time to prototype the software on a Raspberry Pi. I worked out most of the bugs on the prototype and a few more bugs on the HC2 device itself. Each time I built a new vesion of the D server, I had to flash the server’s operating system image to the microSD card with the updated software¹. It’s now been more than two months since I deployed my two HC2 units, and I haven’t needed to update their software even once.

On-disk format

The D protocol allows reading, writing, listing, and stating of files (”chunks”) up to 8MiB in size. Each chunk needs to be written in its entirety, and chunks can’t be overwritten or deleted after they’re initially created². The D server allows chunks to be divided into alphanumeric directories, whose names are chosen by the client. A chunk’s directory and file name are restricted in both character set and length in order to avoid any potential problems with filesystem limits. My existing backup and archival programs produce files that are hundreds of megabytes or gigabytes in size, so the D client is responsible for splitting up these files into chunks before storing them on a D server. On the D server, the hard drive is mounted as /d_server, and all chunks are stored relative to that directory. There’s also some additional systemd prerequisites that prevent the D server from starting if the hard drive fails to mount.

For each D file, the D client stores a single metadata chunk and an arbitrary number of data chunks. Metadata chunks are named based on the D file’s name in such a way that allows prefix matching without needing to read the metadata of every D file, while still following the naming rules. Specifically, any special characters are first replaced with underscores, and then the file’s crc32 checksum is appended. For example, a file named Ef/L/ef1567972423.pb.gpg might store its metadata chunk as index/Ef_L_ef1567972423.pb.gpg-8212e470. Data chunks are named based on the file’s crc32, the chunk index (starting with 0), and the chunk’s crc32. For example, the first data chunk of the previously mentioned file might be stored as 82/8212e470-00000000-610c3877.

Wire protocol

The D client and server communicate using a custom protocol built on top of TCP. This protocol was designed to achieve high read/write throughput, possibly requiring multiple connections, and to provide encryption in transit. It’s similar to TLS, with the notable exception that it uses pre-shared keys instead of certificates. Since the D server can’t be updated, certificate management would just be an unnecessary hassle. Besides, all the files I planned to store on D would already be encrypted. In fact, I originally designed the D protocol with only integrity checking and no encryption. I later added encryption after realizing that in no other context would I transfer my (already encrypted) backups and archives over a network connection without additional encryption in transit. I didn’t want D to be the weakest link, so I added encryption even if it was technically unnecessary.

The D protocol follows a request-response model. Each message is type-length-value (TLV) encoded alongside a fixed-length encryption nonce. The message payload uses authenticated encryption, which ensures integrity in transit. Additionally, the encryption nonce is used as AEAD associated data and, once established, needs to be incremented and reflected on every request-response message. This prevents replay attacks, assuming sufficient randomness in choosing the initial nonce. The message itself contains a variable-length message header, which includes the method (e.g. Read, Write, NotifyError), a name (usually the chunk name), a timestamp, and the message payload’s crc32 checksum. The message header is encoded using a custom serialization format that adds type checking, length information, and checksums for each field. Finally, everything following the message header is the message payload, which is passed as raw bytes directly up to either the D client or server.

File metadata

Each D file has a single metadata chunk that describes its contents. The metadata chunk contains a binary-encoded protobuf wrapped in a custom serialization format (the same one that’s used for the message header). The protobuf includes the file’s original name, length, crc32 checksum, sha256 hash, and a list of chunk lengths and chunk crc32 checksums. I store the file’s sha256 hash in order to more easily integrate with my other backup and archival software, most of which already uses sha256 hashes to ensure file integrity. On the other hand, the file’s crc32 checksum is used as a component of chunk names, which makes its presence essential for finding the data chunks. Additionally, crc32 checksums can be combined, which is a very useful property that I’ll discuss in the next section.

Integrity checking at every step

Since D chunks can’t be deleted or overwritten, mistakes are permanent. Therefore, D computes and verifies crc32 checksums at every opportunity, so corruption is detected and blocked before it is committed to disk. For example, during a D chunk write RPC, the chunk’s crc32 checksum is included in the RPC’s message header. After the D server writes this chunk to disk, it reads the chunk back from disk³ to verify that the checksum matches. If a mismatch is found, the D server deletes the chunk and returns an error to the client. This mechanism is designed to detect hardware corruption on the D server’s processor or memory. In response, the D client will retry the write, which should succeed normally.

The D client also has includes built-in integrity checking. The D client’s file writer requires that its callers provide the crc32 checksum of the entire file before writing can begin. This is essential not only because chunk names are based on the file’s crc32 checksum, but also because the D client writer keeps a cumulative crc32 checksum of the data chunks it writes. This cumulative checksum is computed using the special crc32 combine algorithm, which takes as input, two crc32 checksums and the length of the second checksum’s input. If the cumulative crc32 checksum does not match the provided crc32 checksum at the time the file writer is closed, then the writer will refuse to write the metadata chunk and instead return an error. It’s critical that the metadata chunk is the last chunk to be written because until the file metadata is committed to the index, the file is considered incomplete.

Because the D client’s file writer requires a crc32 checksum upfront, most callers end up reading an uploaded file twice—once to compute the checksum and once to actually upload the file. The file writer’s checksum requirements are designed to detect hardware corruption on the D client’s processor or memory. This kind of corruption is recoverable. For example, if data inside a single chunk is corrupted before it reaches the D client, then the file writer can simply reupload it with its corrected checksum. Since the chunk checksum is included in the chunk name, the corrupt chunk does not need to be overwritten. Instead, the corrupt chunk will just exist indefinitely as garbage on the D server. Additionally, the D server ignores writes where the chunk name and payload exactly match an exsting chunk. This makes the file writer idempotent.

The D client’s file reader also maintains a cumulative crc32 checksum while reading chunks using the same crc32 combine algorithm used in the file writer. If there is a mismatch, then it will return an error upon close. The file reader can optionally check the sha256 checksum as well, but this behavior is disabled by default because of its CPU cost.

List and stat

In addition to reading and writing, the D server supports two auxiliary file RPCs: List and Stat. The List RPC returns a null-separated list of file names in a given directory. The D server only allows chunks to be stored one directory deep, so this RPC can be used as a way to iteratively list every chunk. The response of the List RPC is also constrained by the 8MiB message size limit, but this should be sufficient for hundreds of thousands of D files and hundreds of terabytes of data chunks, assuming a uniform distribution using the default 2-letter directory sharding scheme and a predominate 8MiB data chunk size. The D client library uses the List RPC to implement a Match method, which returns a list of file metadata protobufs, optionally filtered by an file name prefix. The Match method reads multiple metadata chunks in parallel to achieve a speed of thousands of D files per second on a low latency link.

The Stat RPC returns the length and crc32 checksum of a chunk. Despite its name, the Stat method is rarely ever used. Note that stating a D file is done using the Match method, which reads the file’s metadata chunk. In contrast, the Stat RPC is used to stat a D chunk. Its main purpose is to allow the efficient implementation of integrity scrubbing without needing to actually transfer the chunk data over the network. The D server reads the chunk from disk, computes the checksum, and returns only the checksum to the D client for verification.

Command line interface

The D client is primarily meant to be used as a client library, but I also wrote a command line tool that exposes most of the library’s functionality for my own debugging and maintenance purposes. The CLI is named dt (D tool), and it can upload and download files, match files based on prefix, and even replicate files from one D server to another. This replication is done chunk-by-chunk, which skips some of the overhead cost associated with using the D client’s file reader and file writer. Additionally, the CLI can check for on-disk corruption by performing integrity scrubs, either using the aforementioned Stat RPC or by slowly reading each chunk with the Read RPC. It can also find abandoned chunks, which could result from aborted uploads or chunk corruption. Finally, the CLI also performs some system management functions using special RPCs. This includes rebooting or shutting down the D server, spinning down its disk, and checking its free disk space.

Filesystem

The D server uses a GUID Partition Table and an ext4 filesystem on the data disk with a few custom mkfs flags I found after a few rounds of experimentation. I picked the “largefile” ext4 preset, since the disk mostly consists of 8MiB data chunks. Next, I lowered the reserved block percentage from 5% to 1%, since the disk isn’t a boot disk and serves only a single purpose. Finally, I disabled lazy initialization of the inode table and filesystem journal, since this initialization takes a long time on large disks and I wanted the disk to spin down as quickly as possible after initial installation.

Installation and bootstrapping

The D server is packaged for distribution as a Debian package, alongside a systemd service unit and some installation scripts. The microSD card containing the D server’s operating system is prepared using a custom bootstrapper program that I had originally built for my Raspberry Pi security cameras. It was fairly straightforward to extend it to support the HC2 as well. The bootstrapper first installs the vendor-provided operating system image (Ubuntu Linux in the case of the HC2) and then makes a number of customizations. For a D server, this includes disabling the SSH daemon, disabling automatic updates, configuring fstab to include the data disk, installing the D server debian package, formatting and mounting the data disk, setting the hostname, and disabling IPv6 privacy addresses. Some of these bootstrapping steps can be implemented as modifications to the operating system’s root filesystem, but other steps need to be performed upon first boot on the HC2 itself. For these steps, the bootstrapper creates a custom /etc/rc.local script containing all of the required commands.

Once the microSD card is prepared, I transfer it to the HC2, and the rest of the installation is autonomous. A few minutes later, I can connect to the HC2 using dt to confirm its health and available disk space. The D server also contains a client for my custom IPv6-only DDNS service, so I can connect to the D server using the hostname I chose during bootstrapping without any additional configuration⁴.

Encryption and decryption throughput

Before my HC2 units came in the mail, I tested the D server on an Intel NUC and a Raspberry Pi. I had noticed that the Raspberry Pi could only achieve around 8MB/s of throughput, but I blamed this on some combination of the Pi 3 Model B’s 100Mbps ethernet port and the fact that I was using an old USB flash drive as the chunk storage device. So when the HC2 units finally arrived, I was disappointed to discover that the HC2 too could only achieve around 20MB/s of D write throughput, even in highly parallelized tests. I eventually traced the problem to the in-transit encryption I used in the D wire protocol.

I had originally chosen AES-128-GCM for the D protocol’s encryption in transit. This worked perfectly fine on my Intel NUC, since Intel processors have had hardware accelerated AES encryption and decryption for a really long time. On the other hand, the HC2’s ARM cores were a few generations too old to support ARM’s AES-NI instructions, so the HC2 was doing all of its encryption and decryption using software implementations instead. I created a synthetic benchmark from the D protocol’s message encoding and decoding subroutines to test my theory. While my Intel-based prototype achieved speeds of more than a gigabyte per second, the HC2 could only achieve around 5MB/s in single-core and 20MB/s in multi-core tests. Naturally, this was a major concern. Both the hard drive and my gigabit ethernet home network should have supported speeds in excess of 100MB/s, so I couldn’t accept encryption being such severe bottleneck. I ended up switching from AES to ChaCha20-Poly1305, which as supposedly promoted specifically as a solution for smartphone ARM processors lacking AES acceleration. This change was far more effective than I expected. With the new encryption algorithm, I achieved write speeds of more than 60MB/s on the HC2, which is good enough for now.

Connection pools and retrys

The D client is designed to be robust against temporary network failures. All of the D server’s RPCs are idempotent, so basically all of the D client’s methods include automatic retrys with exponential backoff. This includes read-only operations like Match and one-time operations like committing a file’s metadata chunk. For performance, the D client maintains a pool of TCP connections to the D server. The D protocol only allows one active RPC per connection, so the pool creates and discards connections based on demand. The file reader and matcher use multiple threads to fetch multiple chunks in parallel. However, the file writer can only upload a single chunk at a time, so callers are expected to upload multiple files simultaneously in order to achieve maximum throughput. D client operations don’t currently support timeouts, but there is a broadly applied one minute socket timeout to avoid operations getting stuck forever on hung connections. In practice, connections are rarely interrupted since all of the transfers happen within my apartment’s LAN.

Hard drive and accessories

I bought two Seagate BarraCuda 8TB 5400RPM hard drives from two different online stores to put in my two HC2 units. These were around $155 each, which seemed to be the best deal on 3.5” hard drives in terms of bytes per dollar. I don’t mind the lower rotational speed, since they’re destined for backup and archival storage anyway. I also picked up two microSD cards and two 12V barrel power supplies, which are plugged into two different surge protectors. Overall, I tried to minimize the likelihood of a correlated failure between my two HC2 units, but since all of my D files would strictly be a second or third copy of data I had already stored elsewhere, I didn’t bother being too specific with this.

Pluggable storage backends

To make it easier to integrate D into my existing backup and archival programs, I defined a generic interface in Go named “rfile” for reading, writing, and listing files on a storage backend. I ported over my existing remote storage implementation based on Google Cloud Storage buckets, and I wrote brand new implementations for the local filesystem as well as the D client. Rfile users typically pick a storage backend using a URL. The scheme, host, and path are used to pick a storage particular storage backend and initialize it with parameters, such as the GCS bucket name, D server address, and an optional path prefix. For example, d://d1/Ef/ and file:///tmp/Ef/ might be storage backends used by Ef. I also wrote a generic class to mirror files from one storage backend to one or more other backends. Ef uses this mechanism to upload backups from its local filesystem store (using the file backend) to both GCS (using the GCS backend) and two D servers (using the D backend). I can also use this mechanism to replicate Ef backups from one remote backend to another. Luckily, Google Cloud Storage supports the same Castagnoli variant of crc32 that D uses, so this checksum can be used to ensure integrity during transfers between different backends.

Security isolation

Like I mentioned earlier, I can’t log in to my HC2 units since they don’t run SSH daemons. I’ve also configured my router’s firewall rules to explicitly drop all traffic destined to my D servers other than ICMPv6 and TCP on the D server’s port number⁵. So, the attack surface for the D servers is essentially just the kernel’s networking stack and the D server code itself. My router also ensures that only computers on my LAN are allowed to connect via the D server protocol, and the D protocol itself ensures maximum network distance by tightly limiting the maximum round trip time allowed while establishing a connection.

The D server is written in Go and is simple enough to mostly fit in a single file, so I’m not particularly concerned about hidden security vulnerabilities. Even if such a vulnerability existed, the chunk data is mostly useless without the application-layer encryption keys used to encrypt the backups and archives stored within them. Plus, any attacker would first need a foothold on my apartment LAN to even connect to the D servers. Admittedly, this last factor is probably not that hard given all of the random devices I have plugged in to my network.

Conclusion

Currently, I’ve integrated D into two of my existing backup and archival programs. They’ve stored a total of 900GiB on each of my two HC2 servers. It’ll be a long time before I come close to filling up the 8TB hard drives, but when the time comes, my options are to either buy bigger hard drives or to get a third HC2 unit. Let’s assume that I still want every file stored in D to be replicated on at least 2 D servers. If I get a third HC2 unit, then load balancing becomes awkward. Since D chunks can’t be deleted, my optimal strategy would be to store new files with one replica on the new empty D server and the other replica on either of the two existing D servers. Assuming a 50/50 split between the two existing D servers, this scheme would fill up the new empty D server twice as quickly as the two full ones. Maybe in the future, I’ll create additional software to automatically allocate files to different D servers depending on their fullness while also trying to colocate files belonging to a logical group (such as archives of the same Ef backup set).

Well, I eventually gave in and just added a flag to the bootstrapper that enabled SSH access, but I only used this during development. ↩︎
The O_EXCL flag to open can be used to guarantee this without races. ↩︎
Well, from the buffer cache probably. Not actually from disk. ↩︎
To be honest, the DDNS client is probably an unnecessary security risk, but it’s really convenient and I don’t have a good replacement right now. ↩︎
This probably broke the DDNS client, but my apartment’s delegated IPv6 prefix hasn’t changed in years, so I’ll just wait until the D servers stop resolving and then lift the firewall restriction. ↩︎

Post a Comment