My laptop’s hard drive crashed in 2012. I was on campus walking by Evans Hall, when I took my recently-purchased Thinkpad x230 out of my backpack to look up a map (didn’t have a smartphone), only to realize it wouldn’t boot. This wasn’t a disaster by any means. It set me back $200 to rush-order a new 256GB Crucial M4 SSD. But since I regularly backed up my data to an old desktop running at my parent’s house, I was able to restore almost everything once I received it1.
I never figured out why my almost-new laptop’s hard drive stopped working out of the blue. The drive still spun up, yet the system didn’t detect it. But whether it was the connector or the circuit board, that isn’t the point. Hardware fails all the time for no reason2, and you should be prepared for when it happens.
Data management has changed a lot in the last ten years, primarily driven by the growing popularity of SaaS (”cloud”) storage and greatly improved network capacity. But one thing that hasn’t changed is that most people are still unprepared for hardware failure when it comes to their personal data. Humans start manufacturing data from the moment they’re born. Kids should really be taught data husbandry, just like they’re taught about taxes and college admissions and health stuff. But anyway, here are a few things I’ve learned about managing data that I want to share:
Identify what’s important
Data management doesn’t work if you don’t know what you’re managing. In other words, what data would make you sad if you lost access to it? Every day, your computer handles massive amounts of garbage data: website assets, Netflix videos, application logs, PDFs of academic research, etc. There’s also the data that you produce, but don’t intend to keep long-term: dash cam and surveillance footage (it’s too big), your computer settings (it’s easy to re-create), or your phone’s location history (it’s too much of a hassle to extract).
For most people, important data is the data that’s irreplaceable. It’s your photos, your notes and documents, your email, your tax forms, and (if you’re a programmer) your enormous collection of personal source code.
Consider the threats
It’s impossible to predict every possible bad thing that could happen to your data. But fortunately, you don’t have to! You can safely ignore all the potential data disasters that are significantly less likely to occur than your own untimely death3. That leaves behind a few possibilities, roughly in order of decreasing likelihood:
- Hardware failure
- Malicious data loss (somebody deletes your shit)
- Accidental data loss (you delete your shit)
- Data breach (somebody leaks your shit)
- Undetected data degradation
Hardware failures are the easiest to understand. Hard drives (external hard drives included), solid state drives, USB thumb drives, and memory cards all have an approximate “lifespan”, after which they tend to fail catastrophically4. The rule of thumb is 3 years for external hard drives, 5 years for internal hard drives, and perhaps 10 years for enterprise-grade hard drives.
Malicious data loss has become much more common these days, with the rise of a digital extortion scheme known as “ransomware”. Ransomeware encrypts user files on an infected machine, usually using public-key cryptography in at least one of the steps. The encryption is designed so that the infected computer can encrypt files easily, but is unable to reverse the encryption without the attacker’s cooperation (which is usually made available in exchange for a fee). Fortunately, ransomeware is easily detectable, because the infected computer prompts you for money once the data loss is complete.
On the other hand, accidental data loss can occur without anybody noticing. If you’ve ever accidentally overwritten or deleted a file, you’ve experienced accidental data loss. Because it can take months or years before accidental data loss is noticed, simple backups are sometimes ineffective against it.
Data breaches are a unique kind of data loss, because it doesn’t necessarily mean you’ve lost access to the data yourself. Some kinds of data (passwords, tax documents, government identification cards) lose their value when they become available to attackers. So, your data management strategy should also identify if some of your data is condential.
Undetected data degradation (or “bit rot”) occurs when your data becomes corrupted (either by software bugs or by forces of nature) without you noticing. Modern disk controllers and file systems can provide some defense against bit rot (for example, in the case of a bad sectors on a hard disk). But the possibility remains, and any good backup strategy needs a way to detect errors in the data (and also to fix them).
Things you can’t backup
Backups and redundancy are generally the solutions to data loss. But you should be aware that there are some things you simply can’t backup. For example:
- Data you interact with, but can’t export. For example, your comments on social media would be difficult to backup.
- Data that’s useless (or less useful) outside of the context of a SaaS application. For example, you can export your Google Docs as PDFs or Microsoft Word files, but then they’re no longer Google Docs.
Redundancy vs backup
Redundancy is buying 2 external hard drives, then saving your data to both. If either hard drive experiences a mechanical failure, you’ll still have a 2nd copy. But this isn’t a backup.
If you mistakenly overwrite or delete an important file on one hard drive, you’ll probably do the same on the other hard drive. In a sense, backups require the extra dimension of time. There needs to be either a time delay in when your data propagates to the backup copy, or better yet, your backup needs to maintain multiple versions of your data over time.
RAID and erasure encoding both offer redundancy, but do not count as a backup.
Backups vs archives
Backups are easier if you have less data. You can create archives of old data (simple ZIP archives will do) and back them up separately from your “live” data. Archives make your daily backups faster and also make it easier to perform data scrubbing.
When you’re archiving data, you should pick an archive format that will still be readable in 30 to 50 years. Proprietary and non-standard archive tools might fall out of popularity and become totally unusable in just 10 or 15 years.
One way to protect against bit rot is to check it periodically against known-good versions. For example, if you store cryptographic checksums with your files (and also digitally sign the checksums), you can verify the checksums at any time and detect bit rot. Make sure you have redundant copies of your data, so that you can restore corrupted files if you detect errors.
I generate SHA1 checksums for my archives and sign the checksums with my GPG key.
If your backup solution is 2 copies on the same hard drive, or 2 hard drives in the same computer, or 2 computers in the same house, then you’re consolidating your failure domain. If your computer experiences an electrical fire or your house burns down, then you’ve just lost all copies of your data.
Onsite vs offsite backups
Most people keep all their data within a 20 meter radius of their primary desktop computer. If all of your backups are onsite (e.g. in your home), then a physical disaster could eliminate all of the copies. The solution is to use offsite backups, either by using cloud storage (easy) or by stashing your backups at a friend’s house (pain in the SaaS).
Online vs offline backups
If a malicious attacker gains access to your system, they can delete your data. But they can also delete any cloud backups5 and external hard drive backups that are accessible from your computer. It’s sometimes useful to keep backups of your data that aren’t immediately deletable, either because they’re powered off (like an unplugged external hard drive) or because they’re read-only media (like data backups on Blu-ray Discs).
You can reduce your risk of data leaks by applying encryption to your data. Good encryption schemes are automatic (you shouldn’t need to encrypt each file manually) and thoroughly audited by the infosec community. And while you’re at it, you should make use of your operating system’s full disk encryption capabilities (FileVault on macOS, BitLocker on Windows, and LUKS or whatever on Linux).
Encrypting your backups also means that you could lose access to them if you lose your encryption credentials. So, make sure you understand how to recover your encryption credentials, even if your computer is destroyed.
Online account security
If you’re considering cloud backups, you should also take steps to strengthen the security of your account:
- Use a long password, and don’t re-use a password you’ve used on a different website.
- Consider using a passphrase (a regular english sentence containing at least 4-5 uncommon words). Don’t share similar passphrases for multiple services (like “my facebook password”), because an attacker with access to the plaintext can easily guess the scheme.
- Turn on two-factor authentication. The most common 2FA scheme (TOTP) requires you to type in a 6-8 digit code whenever you log in. You should prefer to use a mobile app (I recommend Authy) to generate the code, rather than to receive the code via SMS. Don’t forget to generate backup codes and store them in a physically secure top-secret location (e.g. underneath the kitchen sink).
- If you’re asked to set security questions, don’t use real answers (they’re too easy to guess). Make up gibberish answers and write them down somewhere (preferably a password manager).
- If your account password can be recovered via email, make sure your email account is also secure.
Capacity vs throughput
One strong disadvantage of cloud backups is that transfers are limited to the speed of your home internet, especially for large uploads. Backups are less useful when they take days or weeks to restore, so be aware of how your backup throughput affects your data management strategy.
This problem also applies to high-capacity microSD cards and hard drives. It can take several days to fully read or write a 10TB data archival hard drive. Sometimes, smaller but faster solid state drives are well worth the investment.
File system features
Most people think of backups as “copies of their files”. But the precise definition of a “file” has evolved rapidly just as computers have. File systems have become very complex to meet the increasing demands of modern computer applications. But the truth remains that most programs (and most users) don’t care about most of those features.
For most people, your “files” refers to (1) the directory-file tree and (2) the bytes contained in each file. Some people also care about file modification times. If you’re a computer programmer, you probably care about file permission bits (perhaps just the executable bit) and maybe symbolic links.
But consider this (non-exhaustive) list of filesystem features, and whether you think they need to be part of your data backups:
- Capitalization of file and directory names
- File owner (uid/gid) and permission bits, including SUID and sticky bits
- File ACLs, especially in an enterprise environment
- File access time, modification time, and creation time
- Extended attributes (web quarantine, Finder.app tags, “hidden”, and “locked”)
- Resource forks, on macOS computers
- Non-regular files (sockets, pipes, character/block devices)
- Hard links (also “aliases” or “junctions”)
- Executable capabilities (maybe just CAP_NET_BIND_SERVICE?)
If your answer is no, no, no, no, no, what?, no, no, and no, then great! The majority of cloud storage tools will work just fine for you. But the unfortunate truth is that most computer programmers are completely unaware of many of these file system features. So, they write software that completely ignores them.
Programs and settings
Programs and settings are often left out of backup schemes. Most people don’t have a problem reconfiguring their computer once in a while, because catastrophic failures are unlikely. If you’re interested in creating backups of your programs, consider finding a package manager for your preferred operating system. Computer settings can usually be backed up with a combination of group policy magic for Windows and config files or /usr/bin/defaults for macOS.
If you’re backing up data for an application that uses a database or a complex file-system hierarchy, then you might be better served by an backup system that’s designed specifically for that application. For example, RogerHub runs on a PostgreSQL database, which comes with its own backup tools. But RogerHub uses an application-specific backup scheme that’s tailored to RogerHub specifically.
A backup isn’t a backup until you’ve tested the restoration process.
If you’ve just skipped to the end to read my recommendations, fantastic! You’re in great company. Here’s what I suggest for most people:
- Use cloud services instead of files, to whatever extent you feel comfortable with. It’s most likely not worth your time to backup email or photos, since you could use Google Inbox or Google Photos instead.
- Create backups of your files regularly, using the 3-2-1 rule: 3 copies of your data, on 2 different types of media, with at least 1 offsite backup. For example, keep your data on your computer. Then, back it up to an online cloud storage or cloud backup service. Finally, back up your data periodically to an external hard drive.
- Don’t trust physical hardware. It doesn’t matter how much you paid for it. It doesn’t matter if it’s brand new or if you got the most advanced model. Hardware breaks all the time in the most unpredictable ways.
- Don’t buy an external hard drive or a NAS as your primary backup destination. They’re probably no more reliable than your own computer.
- Make sure to use full-disk encryption and encrypted backups.
- Make sure nobody can maliciously (or accidentally) delete all of your backups, simply by compromising your primary computer.
- Consider making archives of data that you use infrequently and no longer intend to modify.
- Secure your online accounts (see section titled “Online account security”)
- Pat yourself on the back and take a break once in a while. Data management is hard stuff!
If you find any mistakes on this page, let me know. I want to keep it somewhat updated.
And, here’s yet another photo:
- My laptop contained the only copy of my finished yet unsubmitted class project. But technically I had a project partner. We didn’t actually work together on projects. We both finished each project independently, then just picked one version to submit. ↩︎
- About four and a half years later, that m4 stopped working and I ordered a MX300 to replace it. ↩︎
- That is, unless you’re interested in leaving behind a postmortem legacy. ↩︎
- There are other modes of failure other than total catastrophic failure. ↩︎
- Technically, most reputable cloud storage companies will keep your data for some time even after you delete it. If you really wanted to, you could explain the situation to your cloud provider, and they’ll probably be able to recover your cloud backups. ↩︎