Tragedy and loss

Among my favorite anime, there are a few that I never ever recommend to others. It’s not the weird ones. It’s not even just the ecchi. I’m talking about shows that are so precious to me that I’d probably get upset if my recommendation just fell flat. But can you really blame them? Every work of art deserves at least your undivided attention, and yet, I’m guilty too. I’ll gladly watch seasonal shows while also playing Mario Kart or looking at my phone. In a pinch, I’ll even watch them on my phone. There’s so many new shows airing every week, and a lot of them are mediocre and uninspired. How do you expect anyone to keep up while also staying focused through all of them? But this seasonal grind just makes it even more special when a show goes above and beyond, grabs your attention, and forces you to watch even as you’re fighting sleep deprivation. That’s why it’s so frustrating. It’s easy to forget that every single show is the product of thousands of hours of hard work by hundreds of talented artists. People develop such deep emotional connections with their favorite anime, and when someone else just casually watches without the profound respect and appreciation commensurate with this level of accomplishment, it feels a little rude.

This May, I was fortunate enough to visit Uji, Kyoto during Golden Week, which is Japan’s annual week of national holidays. Aside from their well-known green tea, Uji is the home of Kyoto Animation. It also happens to be the inspiration for the setting of Hibike! Euphonium, one of my absolute favorite anime series. Of course, things in real life never look as good as they do in anime, but I was fascinated by the prospect of visiting. The town is centered around the Uji River, which flows south to north toward Kyoto. The Keihan Uji railway line follows this river as it passes by many of the stores, train stations, parks, and playgrounds featured in the show as well, as the studios of KyoAni themselves.

Uji-bashi bridge in Uji, Japan.

The staff at KyoAni are masters in nuanced storytelling and creating memorable settings with incredibly detailed background art. From its depiction in the show, Uji already felt like a familiar place to me before I had ever been there. I knew all the bridges, with their unique wood and stone construction. I knew the ins and outs of Rokujizo train station, with its enigmatic pair of pedestrian crossing buttons. I even knew the specific bench along the river bank where our main character likes to sit on the way home after school. These things are all a lot farther away from each other in reality than the show suggests, which is how I ended up accidentally walking 11km to complete my tour. But I enjoyed every moment of it. Unlike in other tourist destinations, I felt a personal connection to everything in Uji from the utility poles to the tiles on the ground. Eupho isn’t about heroes or prophecies. It’s a story about regular people living in a regular place in which nothing particularly unusual happens. Yet, KyoAni’s adaptation made these places special, not by exaggerating their natural qualities, but by framing them around unforgettable emotional scenes. In this light, even the smallest mundane things become precious. You can tell that the people of Uji really love their city, and now millions of fans across the world do too.

In fact, I encountered several other fans doing the exact same pilgrimage as me. They were all fairly easy to spot. In one case, I saw the same guy multiple times throughout the day, at the KyoAni store and near multiple bridges and shrines from the show. We talked a bit in broken Japanese and English. I also saw various people doing peculiar things, like photographing an empty bench. Or getting off the train, walking to a nearby bakery, inspecting the sign on the door, discovering it was closed for the holidays, and then heading back to the station. I’m sure they spotted me as well, as I left a train station to photograph a pedestrian crossing and then promptly headed back onto the train.

As the sun sank lower in the sky, I made my way up to the Daikichiyama observation deck to watch the sunset. On the way, I passed by two Shinto shrines, both of which appeared in the show and both of which were decorated with promotional materials about the Chikai no Finale sequel movie that was currently screening in local theaters (which I did watch that evening). The secluded mountain location and the iconic square gazebo made the deck feel like an actual sacred site for fans. I met another westerner at the top of the hike who was an exchange student doing physics research in Osaka. We chatted about the impact KyoAni had on our lives as we admired a familiar view of the city below.

The view from Daikichiyama observation deck.

For me, Eupho came a few months before I was scheduled to graduate from college, at a time when I was anxious about the future and about turning my hobby into a career. I was uncertain whether I’d continue to enjoy computer programming once it became work, and I questioned whether I was already stuck just mindlessly doing the one and only thing I was good at. This show could not have come at a better time for me. I watched as the main character, Kumiko, half-heartedly joins her school concert band, playing the same instrument she played throughout middle school, and as she gradually comes to understand the motivation of her peers in order to develop her own sense of purpose. With such a classic premise, I was entirely unprepared for the emotional impact this series would have. The scene in which Kumiko runs across Uji-bashi Bridge feeling a mix of frustration and renewed understanding brought me to tears. KyoAni brings together the perfect combination of incredible character animation, breathtaking background art, flawless compositing, compelling voice acting, and phenomenal writing to tell incredibly relatable and heartbreaking stories. There are no shortcuts and no cheap tricks here. KyoAni’s dedication to their craft is well known, and they deserve every bit of praise they get for it.

It’s hard to express just how beloved KyoAni is in the anime community. They aren’t yet a household name like Studio Ghibli, but they’re hardly just another anime studio. Their moe shows in the late 2000s influenced art styles throughout the entire industry, and their new releases are consistently ranked among the most anticipated series every season among fans. Bookstores in Kyoto are especially proud of their local animation shop. I saw KyoAni art books, character goods, and DVDs prominently displayed front and center in every bookstore I went to. This blog post has only been about my experience with one KyoAni show, but there are so many more I count among my favorites: Violet Evergarden, K-On!, Tamako Market, Dragon Maid, Kyoukai no Kanata, Amaburi, Haruhi, and Clannad. Plus, the studio’s first standalone feature-length film, Koe no Katachi, is my favorite movie. I’m so thankful KyoAni and their works exist in our world, and I’m glad to have been born in a time and place where people can so readily experience them.

Kumiko's bench in Uji, Japan (approximately 34.889612, 135.808309).

Given their massive influence, it’s easy to overlook how small a company Kyoto Animation really is, both in terms of people and revenue. I was surprised to see their iconic yellow studios so discreetly embedded in quiet suburban neighborhoods. Japan is one of the safest countries in the world, and Uji is the epicenter of so much love and goodwill from fans, so I just don’t understand. I don’t understand why I’m hearing what I’m hearing. This can’t possibly be the same Uji I loved visiting so much. It just doesn’t make any sense. It doesn’t. KyoAni is well known for caring so much about their employees, opting to pay salaries and to hire in-house in betweeners. I can’t imagine how they must feel right now. I can’t fathom the extent of what’s been lost in one senseless tragedy. I can’t understand how we’ll get by if this disappears from the world. Please, take care of yourselves.

A data backup system of one’s own

One of the biggest solo projects I’ve ever completed is Ef, my custom data backup system. I’ve written about data loss before, but that post glazed over the software I actually use to create backups. Up until a year ago, I used duplicity to back up my data. I still think duplicity is an excellent piece of software. Custom backup software had been on my to-do list for years, but I kept postponing the project because of how good my existing solution was already. I didn’t feel confident that I could produce something just as sophisticated and reliable on my own. But as time passed, I decided that until I dropped third-party solutions and wrote my own backup system, I’d never fully understand the inner workings of this critical piece of my personal infrastructure. So last June, I started writing a design doc, and a month later, I had a brand new data backup system that would replace duplicity for me entirely.

This remains one of the only times I took time off of work to complete a side project. Most of my projects are done on nights and weekends. At the beginning, I made consistent progress by implementing just one feature per night. Every night, I’d make sure the build was healthy before going to bed. But it soon became clear that unless I set aside time dedicated to this project, I’d never achieve the uninterrupted focus to make Ef as robust and polished as I needed it to be1. Throughout the project, I relied heavily on my past experience with filesystems and I/O, as well as information security, concurrent programming, and crash-consistent databases. I also learned a handful of things from the process that I want to share here, in case you’re interested in doing the same thing!

Scoping requirements

I think it’s a bad idea to approach a big software project without first writing down what the requirements are (doubly so for any kind of data management software). I started out with a list of what it means to me for files to be fully restored from a backup. In my case, I settled on a fairly short list:

  • Directory structure and file contents are preserved, of course
  • Original modification times are replicated
  • Unix modes are intact
  • Case-sensitive

In particular, I didn’t want Ef to support symbolic links or file system attributes that aren’t listed here like file owner, special modes, extended attributes, or creation time. Any of these features could easily be added later, but starting with a short list that’s tailored to my personal needs helped fight feature creep.

I also wrote a list of characteristics that I wanted Ef to have. For example, I wanted backups to be easily uploaded to Google Cloud Storage, which is where I keep the rest of my personal data repositories. But I also wanted fast offline restores. I wanted the ability to do dry run backups and the ability to verify the integrity of existing backups. I wanted to support efficient partial restores of individual files or directories, especially when the backup data needed to be downloaded from the cloud. Finally, I wanted fast incremental backups with point-in-time restore and I wanted all data to be securely encrypted and signed at rest.

Picking technologies

Do you want your backups to be accessible in 30 years? That’s a tall order for most backup systems. How many pieces of 30 year old technology can you still use today? When building Ef, I wanted to use a platform that could withstand decades of abuse from the biting winds of obsolescence. I ended up picking the Go programming language and Google’s Protocol Buffers (disclaimer: I also work here) as the basis of Ef. These are the only two dependencies that could realistically threaten the project’s maintainability into the future, and both are mature well-supported pieces of software. Compare that to the hot new programming language of the month or experimental new software frameworks, both of which are prone to rapid change or neglect.

I also picked Google Cloud Storage as my exclusive off-site backup destination. Given its importance to Google’s Cloud business, it’s unlikely that GCS will ever change in a way that affects Ef (plus, I sit just a few rows down from the folks that keep GCS online). Even so, it’s also easy enough to add additional remote backends to Ef if I ever need to.

For encryption, I went with GnuPG, which is already at the center of my other data management systems. I opted to use my home grown GPG wrapper instead of existing Go PGP libraries, because I wanted integration with GPG’s user keychain and native pinentry interfaces. GPG also holds an irreplaceable position in the open source software world and is unlikely to change in an incompatible way.

Naming conventions

Ef backs up everything under a directory. All the backups for a given directory form a series. Each series contains one or more backup chains, which are made up of one full backup set followed by an arbitrary number of incremental backup sets. Each backup set contains one or more records, which are the files and directories that were backed up. For convenience, Ef identifies each backup set with an auto-generated sequence number composed of a literal “ef” followed by the current unix timestamp. All the files in the backup set are named based on this sequence number.

These naming conventions are fairly inflexible, but this helps reduce ambiguity and simplify the system overall. For example, series names must start with an uppercase letter, and the name “Ef” is blacklisted. This means that sequence numbers and Ef’s backup directory itself cannot be mistaken for backup data sources. Putting all the naming logic in one place helps streamline interpretation of command-line arguments and validation of names found in data, which helps eliminate an entire class of bugs.

On-disk format

Ef creates just two types of backup files: metadata and archives. Each metadata file is a binary encoded protobuf that describes a backup set, its records, and its associated archives. If a record is a non-empty file, then its metadata entry will point to one or more contiguous ranges in the archives that constitute its contents. Individual archive ranges are sometimes compressed, depending on the record’s file type. Each archive is limited to approximately 200MB, so the contents of large records are potentially split among multiple archives. Finally, both the metadata and archives are encrypted and signed with GPG before being written to disk.

Incremental backup sets only contain the records that have changed since their parent backup set. If a file or directory was deleted, then the incremental backup set will contain a special deletion marker record to suppress the parent’s record upon restore. If only a record’s metadata (modification time or Unix mode) changes, then its incremental backup record will contain no archive pointers and a non-zero file size. This signals to Ef to look up the file contents in the parent backup set.

Backup replication is as easy as copying these backup files to somewhere else. Initially, backup files are created in a staging area in the user’s home directory. Ef has built-in support for uploading and downloading backup files to/from a Google Cloud Storage bucket. For faster restores, Ef can also extract individual records from a backup set stored on GCS without first writing the backup files to disk.

Metadata cache

In addition to the backup files, Ef maintains a cache of unencrypted backup set metadata in the user’s home directory in order to speed up read-only operations. This cache is a binary encoded protobuf containing all of the metadata from all known backup sets. Luckily, appending binary protobufs is a valid way to concatenate repeated fields, so new backup sets can simply append their metadata to the metadata cache upon completion. Naturally, the metadata cache can also be rebuilt at any time using the original backup metadata files. This is helpful when bootstrapping brand new hosts and in case the metadata cache becomes corrupted.


The Ef archiver coordinates the creation of a backup set. Conceptually, the archiver maintains a pipeline of stream processors that transform raw files from the backup source into encrypted, compressed, and checksummed archive files. Ef feeds files and directories into the archiver, one at a time. As the archiver processes these inputs, it also accumulates metadata about the records and the generated archives. This metadata is written to the metadata file once the backup is complete. The archiver also features a dry run mode, which discards the generated archives instead of actually writing them to disk. To support incremental backups, the archiver optionally takes an list of preexisting records, which are generated by collapsing metadata from a chain of backup sets.

The components of the archiver pipeline include a GPG writer, a gzip writer, two accounting stream filters (for the archive and for the record range), and the file handle to the output file itself. Although the archiver itself processes files sequentially, Ef achieves some level of pipeline parallelism by running encryption in a separate process (i.e. GPG) from compression and bookkeeping, which are performed within Ef. The archiver’s destination is technically pluggable. But the only existing implementation just saves archives to the user’s home directory.

Slow mode

One of my original requirements for Ef was fast incremental backups. It’d be impossible to deliver this requirement if an incremental backup needed to read every file in the source directory to detect changes. So, I borrowed an idea from rsync and assumed that if a file’s size and modification time didn’t change, then its contents probably didn’t either. For my own data, this is true in virtually all cases. But I also added a “slow” mode to the archiver that disables this performance enhancement.


As you guessed, the Ef unarchiver coordinates the extraction of files and directories from backups. The unarchiver contains mostly the same pipeline of stream processors as the archiver, but in reverse order. Because the records extracted by the unarchiver can come from multiple backup sets, the unarchiver is fed a list of records instead. All desired records must be provided upfront, so the unarchiver can sort them and ensure directories are processed before the files and subdirectories they contain. The unarchiver also features a dry run mode, which makes the unarchiver verify records instead of extracting them.

The unarchiver takes a pluggable source for archive data. Ef contains an implementation for reading from the user’s home directory and an implementation for reading directly from Google Cloud Storage. Since the unarchiver is only able to verify the integrity of records, Ef contains additional components to verify archives and metadata. Since records and archives are verified separately, a verification request for a full backup set involves two full reads of the archives: once to verify that each record’s data is intact and once to verify that the archives themselves are intact. Verification of an incremental backup set is only slightly faster: record verification will still require reading archives older than the backup set itself, but archive verification will only need to verify the archives generated as part of the incremental backup set. Usually, I run verification once in record only mode and again in chain archive only mode, which verifies the archives of all backup sets in a backup chain.

Google Cloud Storage

This isn’t my first Go program that uploads data to Google Cloud Storage, so I had some libraries written already. My asymmetric home internet connection uploads at a terribly slow 6Mbps, so as usual, I need to pick a concurrency level that efficiently utilizes bandwidth but doesn’t make any one file upload too slowly, to avoid wasted work in case I need to interrupt the upload. In this case, I settled on 4 concurrent uploads, which means that each 200MB archive upload takes about 5 minutes.

I usually let large uploads run overnight from my headless Intel NUC, which has a wired connection to my router. That way, uploads don’t interfere with my network usage during the day, and I don’t have to leave my laptop open overnight to upload stuff. If you have large uploads to do, consider dedicating a separate machine to it.

GCS supports sending an optional CRC32C for additional integrity on upload. It was easy enough have Ef compute this and send it along with the request. Theoretically, this would detect memory corruption on whichever GCS server was receiving your upload, but I’ve never seen an upload fail due to a checksum error. I suppose it’s probably more likely for your own computer to fumble the bits while reading your backup archives from disk.

GPG as a stream processor

Fortunately, I had already finished writing my GPG wrapper library before I started working on Ef. Otherwise, I’m certain I wouldn’t have finished this project on time. Essentially, my GPG library provides a Go io.Reader and io.Writer interface for the 4 primary GPG functions: encryption, decryption, signing, and verification. In most cases, callers will invoke two of these functions together (e.g. encryption and signing). The library works by launching an GPG child process with some number of pre-attached input and output pipes, depending on which functions are required. The “enable-special-filenames” flag comes in handy here. With this flag, you can substitute file descriptor numbers in place of file paths in GPG’s command line arguments. Typically, files 3 to 5 are populated with pipes, along with standard output. But standard input is always connected to the parent process’s standard input, in order to support pinentry-curses.

Both the GPG reader and writer are designed to participate as stream processors in a I/O pipeline. So, the GPG reader requires another io.Reader as input, and the GPG writer must write to another io.Writer. If the GPG invocation requires auxiliary streams, then the GPG library launches goroutines to service them. For example, one goroutine might consume messages from the “status-fd” output in order to listen for successful signature verification. Another goroutine might simply stream data continuously from the source io.Reader to the input file handle of the GPG process.

Modification time precision

Modern filesystems, including ext4 and APFS, support nanosecond precision for file modification times. However, not all tools replicate this precision perfectly2. Rather than trying to fix these issues, I decided that Ef would ignore the subsecond component when comparing timestamps for equality. In practice, precision higher than one second is unlikely to matter to me, as long as the subsecond truncation is applied consistently.

Command line interface

In case it wasn’t obvious, Ef runs entirely from the command line with no server or user interface component. I’m using Go’s built-in command line flag parser combined with the Google subcommands library. The subcommands that Ef supports include:

  • cache: Rebuild the metadata cache
  • diff: Print delta from latest backup set
  • history: Print all known versions of a file
  • list: Print all backup sets, organized into chains
  • pull: Download backup files from GCS
  • purge: Delete local copy of old backup chains3
  • push: Upload backup files to GCS
  • restore: Extract a backup to the filesystem
  • save: Archive current directory to a new backup set
  • series: Print all known series
  • show: Print details of a specific backup set
  • verify: Verify backup sets


Once the software was written, I faced a dilemma. Every 6 months, I reinstall macOS on my laptop4. When I still used duplicity, I could restore my data by simply installing duplicity and using it to download and extract the latest backup. Now that I had a custom backup system with a custom data format, how would I obtain a copy of the software to restore from backup?

My solution has two parts. Since I build Ef for both Linux and macOS, I can use my Intel NUC to restore data to my laptop and vice versa. It’s unlikely that I’ll ever reinstall both at the same time, so this works for the most part. But in case of an emergency, I stashed an encrypted copy of Ef for Linux on my Google Drive. Occasionally, I’ll rebuild a fresh copy of Ef and reupload it, so my emergency copy keeps up with the latest features and bug fixes. The source code doesn’t change very rapidly, so this actually isn’t much toil.


Now that I’ve been using Ef for about 9 months, it’s worth reflecting on how it compares to duplicity, my previous backup system. At a glance, Ef is much faster. The built-in naming conventions make Ef’s command line interface much easier to use without complicated wrapper scripts. Since duplicity uses the boto library to access GCS, duplicity users are forced to use GCS’s XML interoperability API instead of the more natural JSON API. This is no longer the case with Ef, which makes authentication easier.

I like that I can store backup files in a local staging directory before uploading them to GCS. Duplicity allows you to back up to a local directory, but you’ll need a separate solution to manage transfer to and from remote destinations. This is especially annoying if you’re trying to do a partial restore while managing your own remote destinations. I can’t be sure which backup files duplicity needs, so I usually download more than necessary.

There are a few minor disadvantages with Ef. Unlike duplicity, Ef only stores files in their entirety instead of attempting to calculate deltas between subsequent versions of the same file with librsync. This makes Ef theoretically less efficient with storage space, but this makes almost no difference for me in practice. Additionally, there are plenty of restrictions in Ef that make it unsuitable to be used by more than one person. For example, file ownership is intentionally not recorded in the metadata, and timestamp-based sequence numbers are just assumed to be globally unique.

Overall, I’ve found that Ef fits really well into my suite of data management tools, all of which are now custom pieces of software. I hope this post gave you some ideas about your own data backup infrastructure.

  1. In total, Ef represents about 5000 lines of code, not including some personal software libraries I had already written for other projects. I wrote about 60% of that during the week of Independence Day last year. ↩︎
  2. In particular, rsync and stat on macOS don’t seem to consider nanoseconds at all. ↩︎
  3. There is intentionally no support for purging backup chains from GCS, because the cloud is infinite of course. ↩︎
  4. This helps me enforce good backup hygiene. ↩︎

Catching up

Hey reader. Happy new year? It has been eight months since I last posted anything on this blog. Is that the longest hiatus I’ve ever taken? Maybe. Probably not. I enjoyed writing those longer posts about computers and generally being a massive edgelord, but it also takes me literally weeks to finish writing a single one of those. That’s time that I just haven’t been able to set aside recently. But I’m back now. I don’t really have anything in particular that I want to write about, so I guess I’ll just go over a few things that happened since the last time we spoke.

Side hustle

Traffic grew last year. I suppose that means the human population is still increasing. It usually does. I’m not sure what more to say about that, except that I’m glad my website infrastructure is built to easily take care of it. In the weeks leading up to Christmas last year, my website was getting upward of a hundred requests per second, for a total of more than 53 million requests that month (although, that includes all the static assets as well). A big part of that growth has come from Canada, which now makes up about one-third of my traffic in the month of January. I guess they take their final exams later?

In any case, I also added new features to the calculator, like the “Advanced Mode”, which answers like 80% of the questions I’ve been asked in comments over the last few years. I redesigned some parts of the user interface and adopted a new responsive ad format from Adsense1. On mobile, responsive ads can spill into the left and right margins to take up the full width of the screen. The creatives for responsive ads tend to be higher quality and higher paying (as I’ve unscientifically measured). But sometimes when the ad units fall back to fixed-size creatives, they get left aligned, and then the whole “full width” thing just looks silly.

Most of my work has actually been behind the scenes. I’ve switched to binary encoded protobufs for the backup system, which greatly improved performance and storage space efficiency. I’ve fixed dozens of little bugs. I’ve implemented and migrated to a new object relational mapping system based off the ideas in this post. I scheduled a new background task that does peer-to-peer data integrity checks between replicas (this one had actually been implemented for a while, but the original design consumed too much cross-region network bandwidth to justify the cost). More recently, I’ve been working on automating more of the provisioning and seeding of new replicas to eliminate some of the toil involved with maintaining them.

I’m satisfied with the progress I’ve made over these last few months, but it’s been a constant struggle to prioritize any of this work. After I get home from my regular job, I’m usually tired of doing computer stuff. Even on the weekends, I tend to focus more on shorter projects, like my personal data management infrastructure and my chrome extensions. Last July, I took a whole week off in order to work on Ef, a custom data backup program I made to replace duplicity (I should write more about this in a separate post), because I knew I’d never have enough time to finish it if I didn’t make it a priority. I’m thankful that my schedule’s flexible enough to take time off whenever I need to, but it’s hard to imagine website stuff demanding the same level of urgency.

Changing teams

After two years, I’ve transferred to a different team at Google. I work on Google Cloud Platform now, which is amusing, because I’ve been using Compute Engine to host this website since before I even graduated from college (I’ve written about this before). So, transferring to the Cloud org made a lot of sense. As an insider, I’ve gained even more appreciation for all the engineering that goes into producing a reliable product like Compute Engine. Plus, being both a customer and an engineer offers advantages on both sides. For example, I’m automatically my own 24/7 tech support, and my experience as a customer means I’m more familiar with the different product offerings than most of my peers. I also sit just a few rows down from the folks that run a lot of the systems I rely on for my personal data management infrastructure, which gives me way more confidence in the system’s reliability (perhaps unreasonably so).

As part of the transfer, I traveled to Los Angeles and New York last fall — my first business trip (shudders). In the past, I declined every opportunity for work-related travel since I never felt like the expensive plane tickets and hotels and the inconvenience of working exclusively on a laptop were justified just to meet people that I could already chat with online anyway. But I don’t regret those two trips, especially my two week trip to New York. In that time, I saw four of my high school friends, and I spent more time being a tourist on nights and weekends than I did when I actually visited as a tourist in 2009 (please don’t click). I also met, for the first time, a bunch of the New York based engineers that I had been working with online for several months (and whom I had referred to only by their usernames). Living in Manhattan for ten days also cleared up a lot of misconceptions I had about what the city was like. Now I know that it’s unlikely I’ll ever want to live their permanently.

After coming back from New York, I don’t think I ever got over the jet lag. I’ve been waking up and going to work more than an hour earlier than before. On the whole, this has been great. There’s less traffic early in the morning. There’s also more parking, and this new 8-to-4 schedule means I can leave work while the sun is still shining (yes, even in the winter). Since my new team is in a different office building, I’ve also started driving to work instead of taking the FREE CORPORATE SHUTTLE. It’s nice to have more control over my own schedule, and the whole trip only takes about 10 minutes from parking spot to parking spot. On the downside, the mileage on my grocery hauler has been increasing noticeably faster.

Fake news

I worked at my student newspaper for a few years (although, not as a writer), so I’m somewhat familiar with the art of saying things that are technically true, but totally misleading. There’s all kinds of tricks to it. Like, when you can’t be bothered to look up real statistics, so you say some (i.e. at least one) or many (i.e. at least more than one). Or when you correctly explain that Wi-Fi could cancer, even if there’s no real reason to believe it actually does. Or when people say two unrelated facts that, when presented together, suggest conclusions that are actually totally wrong. And how about that thing where people try to define compound words by what they sound like they should mean? That’s more annoying than misleading, I suppose.

That’s why I’m really glad the CrashCourse folks are publishing a weekly “Navigating Digital Information” series on YouTube. I’m convinced that communicating truth requires not only facts, but also honest intentions. In the last few months, I’ve seen so many of these slimy techniques used in reporting about technology news. Most of the time, it isn’t even intentional. I think a lot of tech journalists are (understandably) not engineers, but just enthusiastic consumers. Since news tends to oversimplify things, a lot of the nuance and context of the software development process tends to get lost. And that would be totally fine, if all journalists were only interested in conveying a fair and honest approximation of the truth. It’s too easy to just dismiss all reporting as clickbait. In reality, lots of honest journalists still want their reporting to be newsworthy, and so nobody really feels bad about crafting facts into a story, even when there really isn’t one. I think a lot of controversy over software bugs and design proposals tends to be unjustified. But until our culture decides that it’s actually unethical to report this way, I’m forced to close this as won’t fix.


It’s actually been more than a year since I started using Patreon (as a patron, not a creator). But it was only a few months ago that I really started ramping up how much I paid into it. The creators that I follow are a mix of musicians, YouTube channels2 (like CrashCourse), and computer programmers that I think are doing great work. The free market doesn’t always cooperate with people trying to make a career out of their creativity. But that doesn’t mean their work isn’t worth supporting. Of course, Patreon isn’t just donations either. There’s lots of creators that put out really great exclusive content for patrons, and I think contributing directly until they figure out a real path to profitability (or maybe just indefinitely — that’s okay too) goes a long way to supporting more good stuff in the world.

My strategy is to start with the lowest tier for anyone that I’m even a little interested in. If they publish something I really like, I’ll make sure to bump that up. I think it’s one of the few feedback mechanisms that isn’t influenced by things like advertiser friendliness. So if you’ve got disposable income and you’re interested at all in voting with your dollar, I’d encourage you to give it a try.


Just kidding. I stopped reading books almost entirely over the last few months, aside from mandatory book club reading. Instead, I’ve been watching lots and lots of chinese cartoons. I think I spend about 5% of my waking hours watching anime, and another few hours reviewing memes or listening to podcasts afterward. In a typical season, I’ll usually follow between 10 to 15 weekly airing shows. But many of my favorites from the last few months have actually been older stuff: Mob Psycho 100, SHIROBAKO, and The Tatami Galaxy (and its superb movie spin-off).

Also in the last few months, I went to a con for the first time. It was Crunchyroll Expo, which for the last 2 years, has taken place down the street from where I live. I’ve always been frustrated with Crunchyroll as a technology company, but I think they did a great job attracting a lot of top talent to their convention. I’m still a little miffed that nobody asked the darlifra folks about the aliens though.

I’ve also been trying to read more manga. I signed up for Shonen Jump’s monthly subscription service (although I’ve only ever used it to reference old chapters of Naruto). Aside from big shonen titles like Shingeki no Kyojin and Boku no Hero Academia, I don’t actually read much manga.

In any case, it’s midnight at the end of the month, so I’m going to stop writing here. See you next month.

  1. Disclaimer: I work here, but not on ads. ↩︎
  2. I’ve spent more time recently watching super long-form content on YouTube. I think 1 to 2 hours is a nice sweet spot, where I can reasonably finish the video in a single sitting, but not so short that I need to constantly queue up more content. ↩︎

Pretty good computer security

I don’t cut my own hair. I can’t change my own engine oil. At the grocery store, I always pick Tide and Skippy, even though the generic brand is probably cheaper and just as good. “Well, that’s dumb,” you say. “Don’t you know you could save a lot of money?”, you say. But that isn’t the point. Sure, my haircut is simple. I could buy some hair clippers, and I’d love to save a trip to the dealership every year. But there are people who are professionals at cutting hair and fixing cars, and I wouldn’t trust them to program their own computers. So why should they trust me to evaluate different brands of peanut butter?

Fortunately, they’ve made the choice really simple, even for amateurs like me. In fact, everyone relies on modern convenience to some extent. Even if you cut your own hair and even if you’re not as much of a Skippy purist, you’d have to admit. People today do all kinds of crazy complicated tasks, and none of us has time to become experts at all of them.

Take driving, for example. Cars are crazy complicated. Most people aren’t professional drivers. In fact, most people aren’t even halfway decent drivers. But thanks to car engineers, you can do a pretty good job at driving by remembering just a few simple rules1. There are lots of other things whose simplicity goes unappreciated: filing taxes and power outlets and irradiating food with freaking death rays (okay, yes, I’m just naming things I see in my immediate vicinity). We’re lucky that people long ago had the foresight to realize how important/dangerous these things could be, and so they’ve all been aggressively simplified to protect people from themselves.

Are we done then? Has everything been child proofed already? It’s easy to believe so. But if you’re unfortunate enough to still own a computer in 2018, then you’ll know that we still have a ways to go. Computer are deceptively simple. Lots of people use them: adults and little kids and old folks. But despite their wide adoption, it’s still far too easy to use a computer the wrong way and end up in a lot of trouble. Here are a few things I’ve learned about computer security that I want to share. I won’t claim that following these steps will make you invulnerable, because that’s an unattainable goal. But I think these form a pretty good baseline, with which you can at least say you’ve done your due diligence.

Stop using a computer

I’m serious. How about an iPad? Not an option? Keep reading.

Use multi-factor authentication

This is the number one easiest and most effective thing you can do to improve your security posture right away. Many websites offer a feature where you’re required to type in a special code, in addition to your username and password, before you can log in. This code can either be sent to you via text message, or it can be generated using an app on your smartphone2. Multi-factor authentication is a powerful defense against account hijacking, but make sure to keep emergency codes as a backup in case you lose your MFA device.

Get a password manager

People are bad at memorizing passwords, and websites are bad at keeping your passwords safe. You should adopt some method that allows you to use a different, unpredictable password for each of your accounts, whether that’s a password manager or just a sheet of notebook paper with cryptic password hints (but see my post about data loss).

Try post-PC devices

Mobile operating systems (iOS and Android) and Chromebooks are the quintessential examples of trusted platforms with aggressively proactive protection (hardware trust anchors, signed boot, sandboxing, verified binaries, granular permissions). Most people already know this. But perhaps it’s not so obvious how poorly some of these protections translate to traditional programs.

Every program on your computer interacts with the file system, and while most well-behaved programs keep to themselves, every single one technically has access to all of your files. This puts everything on your filesystem, like your documents and even your browser cookies3, at risk. Programs have evolved with this level of flexibility, so filesystem access isn’t something that can easily be taken away. There have been attempts to sandbox traditional computer programs, like those found on the Mac App Store and Windows Store. But these features are strictly opt-in, and most computer programs will likely never adopt them.

There’s plenty of stuff to steal in the filesystem, but that’s not all. Modern operating systems offer debugging interfaces, which allow you to read and write the memory of other programs. They also offer mechanisms for you to read the contents of the clipboard, or take screenshots of other programs, or even control the user’s input devices. These all sounds like terrifying powers in the hands of a malicious program. But don’t forget that poorly-written and insecure programs can just as easily undermine your security, by allowing remote or web-based attackers to take control of them.

Because mobile and web-based platforms are newer, they’ve been able to lock down these interfaces. These new platforms don’t need to support a long legacy of older programs that expect such broad and unfettered access. Each mobile app typically gets an isolated view of the filesystem, where it can only access the files relevant to it. Debugging is disabled, and screenshots are a privilege reserved for system apps.

Don’t type passwords in public

If you film somebody typing their password, it’s surprisingly easy to play it in slow motion and see each key as it’s being pressed. This is especially true if you have a high frame rate camera, or if the victim is typing their password into a smartphone keyboard (or is just really slow at typing). Even if the attacker can’t see your fingers, there’s lots of new research in using acoustic or electromagnetic side channels to record typing from a distance. When you have to type your password, try covering your fingers by closing your laptop halfway, or go to the bathroom and do it there.

Don’t type passwords at all

Lots of people use a simple 6-digit PIN number or a pattern to unlock their smartphone. At a glance, this seems safe, because the number of possible PINs and patterns greatly exceeds how many times the phone lets you attempt to unlock it. But 10-digit keypads and pattern grids are also easily discernible from a long distance. Plus, phones that use at-rest encryption typically derive encryption keys from the unlock passphrase. There isn’t enough entropy in your 6-digit PIN or pattern to properly encrypt your data.

If you care about locking your phone securely, then you should use its fingerprint sensor and set up a long alphanumeric passphrase. You’ll rarely need to actually type your passphrase, as long as your fingerprint sensor works correctly. Thieves can’t steal your fingerprint (unless they’re really motivated), and you can rest assured that your strong password will protect your phone’s contents from brute force attacks.

Full disk encryption

If you have a computer password, but you don’t use full disk encryption, then your password is practically useless against a motivated thief. In most computers, the storage medium can be easily removed and transplanted into a different computer4. This technique allows an adversary to read all your files without needing your computer password.

To prevent this, you should enable full disk encryption on your computer (FileVault on macOS, or BitLocker on Windows). Lots of modern computers and smartphones enable full disk encryption by default. Typical FDE implementations use your login password and encrypt/decrypt files automatically as they’re read from and written to disk. Once the computer is shut down, the key material is forgotten and your data becomes inaccessible. Since the encryption keys are derived from your login password, you should make sure that it’s strong enough to stand up to brute force attacks.

Don’t use antivirus software

What is malware? Antivirus software use a combination of heuristics (patterns of behavior) and signatures (file checksums, code patterns, C&C servers, altogether known as indicators of compromise) to detect bad programs. They’re really effective in stopping known threats from spreading on vulnerable computer networks (picture an enterprise office network). But nowadays, all that antivirus software buys you is some peace of mind, delivered via a friendly green check mark.

A doctor in the United States wouldn’t recommend you buy a mosquito net. Similarly, antivirus software is a superfluous and unsuitable countermeasure for the modern day threats. It doesn’t protect against OAuth worms. It doesn’t protect against network-exploitable RCEs. It doesn’t protect against phishing (and when it tries to, it usually just makes everything worse). Since you’re reading this blog post, I assume you’re at least making an effort to improve your security posture. That fact alone puts you squarely in the group of people who don’t need antivirus software. So get rid of it.

Get a gaming computer

Games need to be fun. They don’t need to be secure. Given the amount of C++ code found in typical high-performance 3D games, I think it’s fairly like that most of them have at least one or two undiscovered network-exploitable memory corruption vulnerabilities. Plus, lots of games come bundled with cheat detection software. These programs typically use your operating system’s debugging interfaces to peek into the other programs running on your computer. I like computer games, but I’d be uncomfortable running them alongside the programs that I use to check my email, fiddle with money, and run my servers. So if you play games, you should consider getting a separate computer that’s specifically dedicated to running games and other untrusted programs.

Get an iPhone

You need a phone that receives aggressive security updates. You need a phone that doesn’t include obscure unvetted apps, preloaded by the manufacturer. You need a phone with million dollar bounties for security bugs. You need a phone made by people who take pride in their work, people who love presenting about their data encryption schemes, people with the courage to stand up against unreasonable demands from their government.

Now, that phone doesn’t need to be an iPhone, per se. But that’s the phone I’d feel most comfortable recommending to others. And if that’s not an option, maybe a Google Pixel?

Research your software

Who makes your software? Do you know? Have you met them? (If you live in Silicon Valley, that’s not such a rhetorical question.) Because security incidents happen all the time, it’s easy to think that they’re random and unpredictable, like getting struck by lightning. But if you pay attention, you’ll notice that a lot of incidents can be attributed to negligence, laziness, or apathy. Software companies that prioritize security are quite rare, because security is a never ending quest, it costs lots of money, and in most cases, it produces no tangible results.

If you’re a computer programmer, you should get in the habit of reading the source code of programs that you use. And if that’s unavailable, then you can use introspection tools5 to reverse engineer them. After all, how can you trust a piece of software to be secure if you’ve never even tried to attack it?

You should also research the software publishers, and decide which ones you do and don’t trust. Do they have a security incident response team? Do they have a security bug bounty? A history of negligent security practices? Are they known for methodical engineering? Or do they prefer getting things done the quick and easy way? If you only run programs written by publishers that you trust, then you can greatly limit your exposure against poor engineering.

Use secure communication protocols

The gold standard in secure communication is end to end encryption with managed keys. This is where your messages are only readable by you and the recipient, but you rely on a trusted third party to establish the identity of the recipient in the first place. You don’t always need to use end to end encryption, since other methods might be more widely adopted or convenient.

Don’t use browser extensions

Browser extensions are so incredibly useful, but I don’t feel comfortable recommending any browser extension to anybody (except those that don’t require any special permissions). There have been so many recent incidents where popular browser extensions were purchased from the original developer, in order to force ads and spyware onto their large preexisting user base. Browser extensions represent one of the few security weaknesses of the web, as a platform. So, you should only trust browser extensions published by well-established tech companies or browser extensions that you’ve written (or audited) yourself.

Turn off JavaScript

I mentioned before that websites run in an isolated sandbox, away from the rest of your computer. That fact supposedly makes web applications more secure than traditional programs. But as a result, web browsers don’t hesitate to boldly execute whatever code they’re fed. To reduce your exposure, you should consider only allowing JavaScript for a whitelist of websites where you actually need it. For example, there are plenty of news websites that are perfectly legible without JavaScript. They’re also the sites that use sketchy ad networks full of drive-by malware, probing for vulnerabilities in your browser.

Browsers allow you to disable JavaScript by default and enable it on a site-by-site basis. When you notice that a site isn’t working without JavaScript, you can easily enable it and refresh the page.

Don’t get phished

The internet is full of liars. You can protect yourself by paying attention to privileged user interface elements (like the address bar), using bookmarks instead of typing URLs, and maintain a healthy amount of skepticism.

Get ready to wipe

If you suspect that malware has compromised your computer, then you should wipe and reinstall your operating system. Virus removal is a fool’s errand. As long as you have recent data backups (and you’re certain that your backups haven’t also been infected), then you should be able to wipe your computer without hesitation.

  1. One pedal goes vroom. The other brakes. Clockwise goes to your right, and counter-clockwise, to your other right. If the dashboard lights up orange, pull over and check your tires. These simple rules are all you need to know in order to drive a car. The other parts don’t really matter. We expect cars to Just Work™, and for the most part, they do. ↩︎
  2. Or a U2F security key, if you really love spending money. ↩︎
  3. Encryption using kernel-managed keys makes this more difficult ↩︎
  4. Soldered components can make this harder. ↩︎
  5. Look at its open files, its network traffic, and its system calls. ↩︎

The rules are unfair

It’s 11pm. I should go to sleep. But sometimes, when it’s 11pm and I know I should go to sleep, I don’t. Instead, I stay up and watch dash cam videos on the internet. Nowadays, you can pop open your web browser and watch what are perhaps the worst moments of someone’s life, on repeat, in glorious high definition. It’s tragic, but also viscerally entertaining and conveniently packaged into small bite sized clips for easy consumption. Of course, not all the videos involve car accidents. Some of them just show bad drivers doing stupid things that happened to be caught on camera. In any case, I’ll probably read some comments, have a chuckle, and then eventually feel guilty enough to go to bed.

Dash cams are impartial observers. They give us rare glimpses into the real world, without the taint of unreliable human narrators1. Unlike most internet videos, there aren’t any special effects or editing. Dash cams don’t jump around from one interesting angle to the next, and they don’t come with cinematic, mood-setting background music (unless the cammer’s radio happens to be on). Instead, dash cams just look straight forward, paying as much attention to the boring as to the interesting, recording a small slice of history with the utmost dedication, until some impact inevitably knocks them off their windshield mount.

You can analyze dashcam footage, frame by frame, to see exactly why things played out the way they did. Most of us drive every day, and most of us know the risks, but these videos don’t bother us. We convince ourselves that “that could never happen to me, because I don’t follow trucks that closely” or “I always look both ways before entering a fresh green light”. Even when an accident or a near miss isn’t the cammer’s fault, we can still usually find something to blame them for. Somebody else may have caused the accident, but perhaps it was the cammer’s own poor choices that created the opportunity to begin with. Who knows? If the cammer had driven more defensively, then maybe nothing would have happened, and there would be one fewer video delaying my bedtime this evening. It’s only natural for us to demand such an explanation. We want evidence that the cammer deserved what happened to them. Because why would bad things just happen to not-bad people for no reason? That makes no sense. We’re not-so-bad people ourselves, after all. And if we can’t find fault with the cammer, then why aren’t all those same terrible things happening to us?

A few weeks ago, some Canadian kid downloaded a bunch of unlisted documents2 from a government website, and so the police went to his house and took away all his computers. “Bah, that’s absurd!”, you say. “Just another handful of bureaucrats that don’t understand technology”, you say. And as a fan of computers myself, I’d have to agree. It’s obviously the web developer’s fault that the documents were accessible without needing to log in, and it’s the web developer’s fault that even a teenager could trivially guess their URLs. But I also realize that, as computer programmers, it’s our natural instinct to focus only on the technical side of an issue. After all, how many countless months have we spent working on access control and security for computer systems? It’s easy for us to see this as a technical failure, to indemnify the kid, and to place the blame entirely on whoever created the website. But alas, the police don’t arrest bad programmers for making mistakes3.

Plenty of people scrape all kinds of data from websites, sometimes in ways that don’t align with the web developer’s intentions. Should they all expect the police to raid their houses too? Well, there are actually several legitimate reasons why this kind of activity could be considered immoral or even illegal (copyright infringement, violating terms of service, or consuming a disproportionate amount of computer resources). But a lot of laws and user agreements also use phrases like “unauthorized use of computers”, and with such vague language, even something as innocent as downloading publicly available files could be classified as a violation. I mean, sure. It’d be difficult to prove that Canada kid had any kind of criminal intent. But it’s pretty clear that nobody authorized him, explicitly or implicitly (by way of hyperlinks), to download all of those files. Isn’t that “unauthorized use”?

Laws are complicated. I’m no lawyer. I don’t even understand most of the regulations that apply to computers and the internet in the United States. But recently, I’ve come to realize that there’s a fundamental disconnect between laws and morality. When people don’t understand their laws, they tend to just assume that the law basically says “do the right thing”. And if you have even a shred of self respect, then you’ll probably reply “I’m doing that already!”. Naturally, we can’t be expected to know all the particulars of what laws do and don’t allow us to do. So, some amount of intuition is required. Fortunately, nobody bothers to prosecute minor infractions, and people tend to talk a lot about unintuitive or immoral laws (marriage, abortion, weed, etc), so they quickly become common knowledge. But I’d argue that laws aren’t actually an approximation of morality. It’d be more accurate to say that laws are an approximation of maximum utility (much like everything else, if you believe utilitarianism).

There are some laws that exist, not because they’re moral truths, but because people are just generally happier when everybody obeys them. Is it immoral to suggest that people are predisposed to certain careers on the basis of their protected classes? Hard to say, but that’s beside the point. We don’t permit those ideas in public discourse, because people are, as a whole, happier and more productive when we don’t talk about those things.

Laws also aren’t an approximation of fairness, but are only fair to the extent that perceived fairness contributes to overall utility. What makes a fair law? That it benefits everyone equally? Obviously, a fair healthcare system should benefit really sick people more than it benefits people who just want antibiotics for their cold. Maybe a fair law should apply equally to everyone, regardless of their protected classes. But under that definition, regressive income tax and compulsory sterilization would be classified as fair laws. Should wealth and fitness be protected classes? What about age? There are plenty of laws that apply only to children4, but then again, maybe that’s why they’re always complaining that the rules are unfair.

I’ve learned to reduce my trust in rules, and I’ve started to distinguish what’s right from what’s allowable. You could argue that, in some sense, utilitarianism is the ultimate form of morality. But there’s a bunch of pitfalls in that direction, so I’ll stop writing now and go to sleep.

The coast near Pacifica, California.

  1. Or so we’d like to think. I feel like most of the time, dash cam videos unfairly favor the cammer. Unlike humans, dash cams can’t turn their head, have a narrower field of vision, and usually only show what’s in front of the vehicle. Actual drivers are expected to be more aware of their surroundings than what a dash cam allows. ↩︎
  2. Even though the documents were unlisted, they had predictable numeric URLs, and so it was trivial to guess the document addresses and download them all. ↩︎
  3. They blame architects when bridges collapse. Should computer programs that are relevant to public safety be held to the same standard? ↩︎
  4. I think that it’s impossible to believe in fairness unless you also believe in the idea of the individual. Are children individuals? What about conjoined twins? ↩︎

Computer programming

I think I like being a computer programmer. Really. It’s not just something I say because it sounds good in job interviews. I’ve written computer programs almost every day for the last five years. I spend so much time writing programs that I can’t imagine myself not being a computer programmer. Like, could you enjoy food without knowing how to cook? Or go to a concert, having never seen a piece of sheet music? Yet plenty of people use computers while not also knowing how to reprogram them. That’s completely foreign to me and makes me feel uncomfortable. Fortunately, watching uncomfortable things also happens to be my favorite pastime.

In fact, I spend a lot of time just watching regular people use their computers. If we’re on the bus and I’m staring over your shoulder at your phone, I promise it’s not because I have any interest in reading your text messages. I actually just want to know what app you’re using to access your files at home. After all, you probably have just as many leasing contracts PDFs and scanned receipts as I do. Yet, you somehow manage to remotely access yours without any servers or programmable networking equipment in your apartment.

My own digital footprint has gotten bigger and more complicated over the years. Right now, the most important bits are split between my computers at home and a small handful of cloud services1. At home, I actually use two separate computers: my MacBook Pro and a small black and silver Intel NUC that sits mounted on my white plastic Christmas tree. The NUC runs Ubuntu and is responsible for building all the personal software I write, including Hubnext which runs this website. I also use it as a staging area for backups and large uploads to the cloud that run overnight (thanks U.S. residential ISPs). I use the MacBook Pro for everything else: web browsing, media consumption, and programming. It’s become fairly easy to write code on one machine and run it on the other, thanks to a collection of programs I slowly built up. Maybe I’ll write about those one day.

At some point after starting my full time job, I noticed myself spending less and less time programming at home. It soon became apparent that I was never going to be able to spend as much time on my personal infrastructure as I wanted to. In college, I ran disaster recovery drills every few months or so to test my backup restore and bootstrapping procedures. I could work on my personal tools and processes for as long as I wanted, only stopping when I ran out of ideas. Unfortunately, I no longer have unlimited amounts of free time. I find myself actually estimating the required effort for different tasks (can you believe it?), in order to weigh them against other priorities. My to-do list for technical infrastructure continues to grow without bound, so realistically, I know that I’ll never be able to complete most of those tasks. That makes me sad. But perhaps what’s even worse is that I’ve lost track of what I liked about computer programming in the first place.

The shore near Pacifica Municipal Pier.

Lately, I’ve been spending an inordinate amount of time working on Chrome extensions. There are a handful that I use at home and others that I use at work. But unlike most extensions, I don’t publish these anywhere. They’re basically user scripts that I use to customize websites to my liking. I have one that adds keyboard shortcuts to the Spotify web client. Another one calculates the like to dislike ratio and displays it as a percentage on YouTube videos. Lots of them just hide annoying UI elements or improve the CSS of poorly designed websites. Web browsers introduced all sorts of amazing technologies into mainstream use: sandboxes for untrusted code, modern ubiquitous secure transport, cross-platform UI kits, etc. But perhaps the most overlooked is just how easy they’ve made it for anybody to reverse engineer websites through the DOM2. Traditional reverse engineering is a rarefied talent and increasingly irrelevant for anybody who isn’t a security researcher. But browser extensions are approachable, far more useful, and also completely free, unlike a lot of commercial RE tools.

When I work on browser extensions, I don’t need a to-do list or any particular goal. I usually write extensions just to scratch an itch. I’ll notice some UI quirk, open developer tools, hop over to Atom, fix it, and get right back to browsing with my new modification in place. Transitioning smoothly between web browsing and extension programming is one of the most pleasant experiences in computer programming: the development cycle is quick, the tools are first class, and the payoff is highly visible and immediate. It made me remember that computer programming could itself be enjoyed as a pastime, rather than means to an end.

The bluffs at Mori Point.

I’m kind of tired of sitting down every couple months to write about some supposedly fresh idea or realization I’ve had. At some point, I’ll inevitably run out of things to say. Until that happens, I guess I’ll just keep rambling.

The other major recent development in my personal programming work is that I’ve started merging all my code into a monorepo. The repo is just named “R”, because one advantage of working for yourself is that you don’t have to worry about naming anything the right way3. It started out with just my website stuff, but since then I’ve added my security camera software, my home router configuration, my data backup scripts, a bunch of docs, lots of experimental stuff, and a collection of tools to manage the repo itself. Sharing a single repo for all my projects comes with the usual benefits: I can reuse code between projects. It’s easier to get started on something new. When I work on the repo, I feel like I’m working on it as a whole, rather than any individual project. Plus, code in the repo feels far more permanent and easy to manage than a bunch of isolated projects. It’s admittedly become a bit harder to release code to open source, but given all the troublesome corporate licensing nonsense involved, I’m probably not planning to do that anyway.

I feel like I’m just now finishing a year-long transition process from spending most of my time at school to spending most of my time at work. It took a long time before I really understood what having a job would mean for my personal programming. My hobby had been stolen away by my job, and to combat that feeling, I dedicated lots of extra time to working on personal projects outside of work. But instead of finding a balance between hobbies and work, the extra work just left me burnt out. That situation has improved somewhat, partially because I’m been focusing on projects that let me reduce how much time I spend maintaining personal infrastructure, but also because I’ve accomplished all the important tasks and I’ve learned to treat the remainder as a pastime instead of as chores. But there’s still room for improvement. So, if you’re wondering what I’m working on cooped up in my apartment every weekend, this is it.

At some point, I should probably write an actual guide about getting started with computer programming. People keep emailing me about how to get started, even though I’m not a teacher (at least not anymore) and I don’t even regularly interact with anyone who’s just starting to learn programming. And I should probably do it before I start to forget what it’s like to enjoy programming altogether.

  1. Things are supposedly set up with enough redundancy to lose any one piece, but that’s probably not entirely true. I’ve written about this before. In any case, one of the greatest perks of working at the big G is the premium customer support that’s implicitly offered to all employees, especially those in engineering. If things really went south, I guess I could rely on that. ↩︎
  2. I really appreciate knowing just enough JavaScript to customize web applications with extensions. It’s one of the reasons I gave up vim for Atom. ↩︎
  3. Lately, I’ve been fond of naming things after single letters or phonetic spellings of letters. I have other things named A, B, c2, Ef, gee, Jay, K, L, Am, and X. ↩︎

Net neutrality

I don’t know a single person who actually supports the FCC’s recent proposal to deregulate broadband internet access by reclassifying it as an information service. However, I also never did meet a single person who unironically supported last year’s President-elect, but that doesn’t seem to have made any difference. There were certainly a lot of memes in last year’s U.S. election cycle, but I remember first seeing memes about net neutrality when I was in high school. That was before all the Comcast-Verizon-YouTube-Netflix hubbub and before net neutrality was at the forefront of anyone’s mind. So naturally, net neutrality got tossed out and ignored along with other shocking but “purely theoretical” future problems like climate change and creepy invasive advertising1. But I’ve seen some of those exact same net neutrality infographics resurface in the last couple weeks, and in retrospect, I realize that many of them were clearly made by people weren’t network engineers and weren’t at all familiar with how the business of being an ISP actually works. And so, the materials used in net neutrality activism were, and still are, sometimes inaccurate, misleading, or highly editorialized to scare their audience into paying attention2.

Now, just to be clear: Do I think the recent FCC proposals are in the best interest of the American public? Definitely not. And do I think that the general population is now merely an audience to political theater orchestrated by, or on behalf of, huge American residential ISPs? I think it’s probable. After all, why bother proposing something so overwhelmingly unpopular unless it’s already guaranteed to pass? Nevertheless, I feel like spreading misinformation about the realities of net neutrality is only hurting genuine efforts to preserve it.

Before I begin, I should mention that you can read the full 210 page proposal on yourself if you want the full picture. I hate that news websites almost never link to primary sources for the events they’re covering, and yet nobody really holds them accountable to do so. Anyway, I’m no lawyer, but I feel like in general, the legal mumbo-jumbo behind controversial proposals is usually far more subtle and less controversial-sounding than the news would have you believe. That’s very much the case here. And for what it’s worth, I think this particular report is very approachable for normal folks and does do a good job of explaining context and justifying its ideas.

Brush plant in San Francisco.

If you haven’t read the proposal, the simple version is that in mid-December 2017, the FCC will vote on whether to undo the net neutrality rules they set in 2015 and replace them with the requirement that ISPs must publish websites that explain how their networks work. The proposal goes on to explain how the existing rules stifle innovation by discouraging investment in heavily-regulated network infrastructure. In practice, the proposed changes would eliminate the “Title II” regulations that prevent ISPs from throttling or blocking legal content and delegate the policing of ISPs to a mix of the FTC, antitrust laws, and market forces.

It’s hard to predict the practical implications of these changes, but based on past examples, many people point to throttling of internet-based services that compete with ISPs (like VoIP) and high-bandwidth applications (like video streaming and BitTorrent), along with increased investment in ISP-owned alternatives to internet services for things like video and music streaming. As a result, a lot of net neutrality activism revolves around purposefully slowing down websites or presenting hypothetical internet subscription plans that charge extra fees for access to different kinds of websites (news, gaming, social networking, video streaming) in order to illustrate the consequences of a world without net neutrality. But in reality, neither of these scenarios is very realistic, and yet both of them already exist in some form today.

You probably don’t believe me when I say that throttling is unrealistic, and I understand. After all, we’ve already seen ISPs do exactly that to Netflix and YouTube. But first, you should understand a few things about residential ISPs.

Like most networks, residential ISPs are oversubscribed, meaning that the individual endpoints of the network are capable of generating more traffic than the network itself can handle. They cope with oversubscription by selling subscription plans with maximum throughput rates, measured in megabits per second3. Upstream traffic (sent from your home to the internet) is usually throttled down to these maximum rates by configuration on your home modem. But downstream traffic (sent from the internet to your home) is usually throttled at the ISP itself by delaying or dropping traffic that exceeds this maximum configured limit. So you see, the very act of throttling or blocking traffic isn’t a concern for net neutrality. In fact, most net neutrality regulations have exemptions that allow this kind of throttling when it’s for purely technical reasons, because some amount of throttling is an essential part of running a healthy network.

Furthermore, all ISPs already discriminate (e.g. apply different throttling or blocking rules) against certain types of traffic by way of classification. At the most basic level, certain types of packets (like zero-length ACKs) and connections (like “mice flows”) are given priority over others (like full sized data packets for video streaming) as part of a technique known as quality of service (QoS). Many ISPs also block or throttle traffic on certain well-known ports, such as port 25 for email and port 9100 for printing, because they’re commonly abused by malware and there’s usually no legitimate reason to route such traffic from residential networks onto the open internet. Furthermore, certain kinds of traffic can be delivered more quickly and reliably simply because of networking arrangements made between your ISP and content providers (like Netflix Open Connect). In other cases, your ISP may be stuck in a disadvantageous peering agreement, in which it has to pay another ISP extra money to send or receive traffic on certain network links, in addition to just the costs of maintaining the network itself.

People generally agree that none of these count as net neutrality violations, because they’re practical aspects of running a real network and, in many cases, they justify themselves by providing real value to end users. It’s difficult to explain concisely what divides these kinds of blocking and throttling from the scandalous net neutrality kind. Supposedly, net neutrality violations typically involve blocking or throttling for “business” reasons, but “reducing network utilization by blocking our competitors” could arguably have technical benefits as well. In practice, most people call it a net neutrality violation when it’s bad for customers and call it “business as usual” when it’s either beneficial for customers or represents the way things have always worked. In any case, the elimination of all blocking and throttling is neither practical nor desirable. When discussing net neutrality, it’s important to acknowledge that many kinds of blocking and throttling are legitimate and to (try to) focus on the kinds that aren’t.

Leaves in Golden Gate Park.

Websites that purposefully slow themselves down paint a wildly inaccurate picture of a future without net neutrality, especially when they do so without justification. ISPs gain nothing from indiscriminate throttling, other than saving a couple bucks on power and networking equipment. Plus, ISPs can (and do) get the same usage reduction benefits by imposing monthly bandwidth quotas, which have nothing at all to do with net neutrality. I think a more likely outcome is that ISPs will start pushing for the adoption of new heterogeneous internet and TV combo subscription plans. These plans will impose monthly bandwidth quotas on all internet traffic except for a small list of partner content providers, which will complement a larger emphasis on ISP-provided TV and video on demand services. After all, usage of traditional notebook and desktop computers is on the decline in favor of post-PC era devices like smartphones and tablets. A number of U.S. households would probably be perfectly happy to trade boring unmetered internet for a 10GB/month residential broadband internet plan combined with a TV subscription and unlimited access to the ISP’s first-party video on demand service along with a handful of other top content providers. Such a plan could eliminate the need for third-party video streaming subscriptions like Netflix, thereby providing more content for less money. Naturally, a monthly bandwidth quota would make it difficult for non-partner video streaming services to compete effectively, but fuck them, right?

I should point out that no matter what happens to net neutrality, we’ll still have antitrust laws (the proposal mentions this explicitly) and an aggressive DoJ to chase down offenders. Most ISPs operate as a local monopoly or duopoly. So, using their monopoly position in the internet access market to hinder competition in internet content services sounds like an antitrust problem to me. But it’s possible that the FCC’s reclassification of internet access as an “information service” may change this.

The other example commonly used by net neutrality activists is the a-la-carte internet subscription. In this model, different categories of content (news, gaming, social networking, video streaming) each require an extra fee or service tier, sort of like how HBO and Showtime packages work for TV today. For this to work, ISPs need to be able to block access to content that subscribers haven’t paid for. In the past, this might have been implemented with a combination of protocol-based traffic classification (like RTMP for video streaming), destination-based traffic classification (well known IP ranges used by online games), and plain old traffic sniffing (reconstructing plaintext HTTP flows). But such a design would be completely infeasible from a technical standpoint in today’s internet.

First, nearly all modern internet applications use some variant of HTTP to carry the bulk of their internet traffic. Even applications that traditionally featured custom designed protocols (video conferencing, gaming, and media content delivery) now almost exclusively use HTTP or HTTP-based protocols (HLS, WebSockets, WebRTC, REST, gRPC, etc). This is largely because HTTP is the only protocol that has both ubiquitous compatibility with corporate firewalls and widespread infrastructural support in terms of load balancing, caching, and instrumentation. As a result, it’s far more difficult today to categorize internet traffic with any degree of certainty based on the protocol alone.

Additionally, most of the aforementioned HTTP traffic is encrypted (newer variants like SPDY and HTTP/2 virtually require encryption to work). For the a-la-carte plan to work, you need to first categorize all internet traffic. We can get some hints from SNI and DNS, but that’s not always enough and also easily subverted.

Internet applications with well-known IP ranges are also a thing of the past. Colocation has given way to cloud hosting, and it’s virtually impossible to tell what’s inside yet another encrypted HTTPS stream to some AWS or GCP load balancer.

Essentially, there can’t truly exist a “gaming” internet package without the cooperation of every online game developer on the planet.

A-la-carte models work well with TV subscriptions because there are far fewer parties involved. If ISPs ever turn their attention to gamers, it’s most likely that they’ll partner with a few large “game networks” that can handle both the business transactions4 and the technical aspects of identifying and classifying their game traffic on behalf of residential ISPs. So, you probably won’t be buying an “unlimited gaming” internet package anytime soon. Instead, you’ll be buying unlimited access to just Xbox Live and PSN. From that point on, indie game developers will simply have to fit in your monthly 10GB quota for “everything else”.

Reeds near the San Francisco Bay.

Net neutrality activists say that net neutrality will keep the internet free and open. But the very idea of a “free and open” internet is a sort of a myth. To many people, the ideal internet is a globally distributed system of connected devices, where every device can communicate freely and equally with every other device. In a more practical sense, virtually anybody should have the power to publish content on the internet, and virtually anybody should be able to consume it. No entity should have control over the entire network, and no connection to the internet should be more special than any other, because being connected to the internet should mean that you’re connected to all of it.

In reality, people have stopped using the internet to communicate directly with one another. Instead, most internet communication today is mediated by the large global internet corporations that run our social networks, our instant messaging apps, our blogs, our news, and our media streaming sites. Sure, you’re “free” to post whatever garbage you’d like to your Tumblr or Facebook page, but only as long as those companies allow you to do so.

But the other half of a “free and open” internet means that anybody can start their own Facebook competitor and give control back to the people, right? Well, if you wanted to create your own internet company today, your best bet would be to piggyback off of an existing internet hosting provider like AWS or GCP, because they’ve already taken care of the prerequisites for creating globally accessible (except China) internet services. At the physical level, the internet is owned by the corporations, governments, and organizations that operate the networking infrastructure and endpoints that constitute the internet. The internet only works because those operators are incentivized to bury thousands of miles of optical fiber beneath the oceans and hang great big antennas on towers in the sky. It only works because, at some point in the past, those operators made agreements to exchange traffic with each other5 so that any device on one network could send data that would eventually make it to any other network. There’s nothing inherent about the internet that ensures the network is fully connected (in fact, this fully-connectedness breaks all the time during BGP leaks). It’s the responsibility of each individual content publisher to make enough peering arrangements to ensure they’re reachable from anyone who cares to reach them.

Not all internet connections are equal. Many mobile networks and residential ISPs use technologies like CGNAT and IPv6 tunneling to cope with IPv4 address exhaustion. But as a result, devices on those networks must initiate all of their traffic and cannot act as servers, which must be able to accept traffic initiated by other devices. In practice, this isn’t an issue, because mobile devices and (most) residential devices initiate all of their traffic anyway. But it does mean that such devices are effectively second class citizens on the internet.

It’s also an increasingly common practice to classify internet connections by country. Having a privileged IP address, like an address assigned to a residential ISP in the United States, means having greater access and trust when it comes to country-restricted media (like Netflix) and trust-based filtering (like firewalls), compared to an address assigned to a third world country or an address belonging to a shared pool used by a public cloud provider. This is especially the case with email spam filtering, which usually involves extensive IP blacklists in addition to rejecting all traffic from residential ISPs and shared public cloud IP pools.

Finally, let’s not forget those countries whose governments choose to filter or turn off the internet entirely on some occasions. But they have bigger things to worry about than net neutrality anyway.

So, is the internet doomed? Not quite. It’s already well known that last mile ISPs suck and IPv4 exhaustion sucks and IP-based filtering sucks. But as consumers, we still need strong government regulation on residential internet service providers, just like we need regulation on any monopolistic market. People often say that the internet is an artifact, not an invention. So we all share a responsibility to make it better, but we should try to do so without idealistic platitudes and misleading slogans.

  1. I’m kidding, of course. ↩︎
  2. This seems to be quite common with any kind of activism where people try to get involved via the internet. ↩︎
  3. This is not the only way to sell networking capacity. Many ISPs charge based on total data transfer and allow you to send/receive as fast as technically possible. This is especially common with cellular ISPs and cloud hosting. ↩︎
  4. “Business transactions” ↩︎
  5. Read all about it on ↩︎

Feel bad

The hardest truth I had to learn growing up is that not everyone is interested in the truth, and maybe that’s okay. I’m not talking about material truths like the mounting evidence for climate change or incredible importance of space exploration. There’s nothing to debate about material truths, and while some people dispute them, you can hardly argue that ignoring them is okay. I’m talking about truths about people. I’m talking about whether our laws are fair to the poor. I’m talking about whether certain races or genders or social classes are predisposed to be better or worse at their jobs. I’m talking about whether regulated capitalism is the best we can ask for, whether crime is a racial issue, whether homosexuality is a mental illness, whether the magic sky man really does watch everything you do, whether abortion is baby murder, whether Steve Jobs would like the new iPhone, and whether we’re cold-hearted monsters for selling guns to people who can’t make up their minds about any of these issues in a country where people die from preventable gun violence every day. I want to know why these questions cause so many problems, and I want to know why, in many cases, it seems like the best thing to do is to ignore them.

Here’s some truth for you: I don’t particularly mind if nobody reads this blog, and yet I know that some people do. I know there’s spillover traffic from the number one most popular page on this site, and although I haven’t checked Google Analytics in a few months, I know that the spillover is enough so that every few hours, some poor sucker winds up on one of these blog posts and miraculously reads the whole thing. I could just as happily write to a quiet audience as I could to nobody at all, because ultimately I just like the idea of publishing my thoughts to the public record. Here’s another truth: I’m not wearing pants (I almost never wear pants at home). I finished my dinner and finished washing the dishes and wiping down the counter and now I’m sitting in the dark with no pants on, wondering whether it’s time for a new laptop. More truths: the deadlines at work are getting to my head, even though I won’t admit it. I don’t exactly like my work, but I do like my workplace, and the thought of leaving is too frightening to even consider right now. I feel like my list of things to do has been growing endlessly ever since I graduated from college, and I try not to think of what things will be like in a year, because I’m afraid I won’t like the answer.

When I was around 15 years old, I started getting all kinds of ideas in my head, and the biggest one was that the internet could teach me all kinds of truths that nobody would teach me before. At that moment, I felt like my brain had unlocked an extra set of cores1. I thought that, given enough time, a mind in isolation could discover all the grand truths of the universe. I wondered if that mind could be my mind, and I couldn’t wait to get started.

Some of the discovery was just disgusting stuff. I went on the internet searching for the nastiest, most shocking images I could find, because I thought they would help me better understand how people’s brains worked. After all, images were just a bunch of colorful pixels. I thought it was a form of intellectual impurity to think that some images, and the ideas they depicted, should be off-limits to a curious mind. So, forcible desensitization was the only way.

But I actually spent most of my time reading about science. I learned as much physics as Wikipedia and Flash web applets could teach a teenager…which turned out to be mostly unhelpful when the time came to learn actual physics. I learned about aviation accidents and nuclear proliferation and ancient humans. I learned about the history of American space exploration and how modern religions were developed. I learned that people in the United States lived quite differently than most of the world, in more ways than what wealth alone could explain. But most importantly, I learned (what I thought were) several uncomfortable truths about people, particularly truths that many people denied2.

For many years, I went on carrying these ideas quietly in my head. They made me cynical and disdainful toward other people who didn’t think the way I did, because in my head, they weren’t just different. They were wrong. I can always identify other people that had the same experience growing up (or at least I imagine that I can). They all have the same dead fish look in their eye that says “you’re too afraid to face the truth, but I’m not”. And frankly, I very much still miss those days when all I needed was a computer and a deeply rooted misbelief to disprove.

I’m surprised that I ever grew out of that phase. I suppose everyone grows out of everything, sooner or later. But it’s not that my beliefs have changed or weakened. Rather, I realized that most of my cynicism was based on the false premise that nothing was more important than the truth.

It’s hard to look past a truth-centric view of the world when people spend so much time just trying to prove other people wrong. But as a matter of fact, the ultimate goal (the “objective function”) for a human isn’t how right you are3. People want simpler things. They want to feel good and they want the reassurance that they’ll continue to feel good in the future. They want to feel accepted and live productive purposeful lives. Finding the truth can help advance these objectives, but it is not itself a goal. At the end of the day, most of us are tired from working a full-time job and can’t be bothered to care so much. After all, if a couple of white lies make lots of people happier, who’s to say that that’s not okay?

Parking lot at Castle Rock State Park, California.

  1. An extra lobe of gray matter? ↩︎
  2. Like how the government is mind controlling people with cell towers and chemtrails. ↩︎
  3. Yes, even for people who really enjoy being right. ↩︎

The serving path of Hubnext

I’ve always liked things straight and direct to the point. But I feel a bit silly writing that here, since most of my blog posts from high school used a bunch of long words for no reason. My writing did eventually become more concise, but it happened around the time I also sort of stopped blogging. I stopped writing altogether for a few years, but then in college, I started doing an entirely different kind of writing. I wrote homework assignments and class announcements for the computer science classes as a teaching assistant. Technical writing is different from blogging, but writing for a student audience also has its own unique challenges. The gist of it is that students don’t like reading, which is naturally at odds with the long detailed project specs that some of our projects require. It’s the instructor’s responsibility to make sure the important details stand out from the boring reference material.

In college, I also started eating lunch and dinner on the go regularly, which is something I had never really done before. It was a combination of factors that made walk-eat meals so enticing. The campus was big, so getting from place to place usually involved a bit of walking. Street level restaurants sold fast food in convenient form factors. Erratic scheduling meant it was unusual to be with another person for more than a couple hours at a time, so sit-down meals didn’t always make sense. I got really good at the different styles of walk-eating, from gnawing on burritos without spilling to balancing take-out boxes between my fingers. Eating on my feet felt like the most distilled honest form of feeding. It didn’t make sense to add so many extra steps to accomplish essentially the same task.

For a long time, the serving path of this website was too complicated. The serving path is all the things directly involved in delivering web pages to visitors, which can be contrasted with auxiliary systems that might be important, but aren’t immediately required for the website to work. I made my first website more than 12 years ago. Lots of things have changed since then, both with me and with the ecosystem of web development in general. So when I set out to reimplement RogerHub, I wanted to eliminate or replace as many parts of the serving path as I could.

Mission beach in San Francisco.

Let’s start with the basics. Most of the underlying infrastructure for RogerHub hasn’t changed since I moved the website from Linode to Google Cloud Platform in January 2016. I use Google’s Cloud DNS and HTTP(S) Load Balancing to initially capture inbound traffic. Google’s layer 7 load balancing provides global anycast, geographic load balancing, TLS termination, health checking, load scaling, and HTTP caching, among other things. It’s the best offering available at its low price point, so there’s little reason to consider alternatives1. However, HTTP(S) Load Balancing did have a 2 hour partial outage last October. I don’t get much traffic in October, but the incident made me start thinking of ways I could mitigate a similar outage in the future.

Behind the load balancer, RogerHub is served by a number of functionally identical backend servers (currently 2). These servers are fully capable of serving user traffic directly, but they currently only accept traffic from Google’s designated load balancer IP ranges. During peak seasons, I usually scale this number up to 3, but the extra server costs me about $50 a month, so I usually prefer to run just 2. These extra servers exist entirely for redundancy, not capacity. A single server could serve a hundred times peak load with no problem2.

Each server is a plain old Ubuntu host running an instance of PostgreSQL and an instance of the Hubnext application. That’s it. There’s no reverse proxy, no memcached, no mail server, and not even a provisioner (Hubnext knows how to install and configure its own dependencies). Hubnext itself can run in multiple modes, most of which are for maintenance and debugging. But my web servers start Hubnext in “application” mode. When Hubnext runs in application mode, it starts an ensemble of concurrent tasks, only one of which is the web server. The others are things like an unique ID allocator, peer to peer data syncing, system health checkers, maintenance daemons, and half a dozen kinds of in-memory caches. Hubnext can renew its own TLS certificate, create backups of its data, and even page me on my iPhone when things go wrong. Before Hubnext, these tasks were handled by haphazard cron jobs and independent services, which were set up and configured by a Puppet provisioner. Keeping all these tasks in a single application has made developing new features a lot simpler.

So why does Hubnext still need PostgreSQL? It’s true that Hubnext could have simply kept track of its own data, along with maintaining the various indexes that make common operations fast. But it’s an awful lot of work and unneeded complexity to implement something that a database already does for free. Of all the components of a traditional website’s architecture, I chose to keep the database, because I think PostgreSQL pulls its weight more than any of the other systems that Hubnext does supplant. That being said, Hubnext intentionally doesn’t use the transactions or replication that PostgreSQL provides (e.g. the parts of a database most sensitive to failure). Instead, Hubnext’s data model is designed to work without multi-statement transactions and Hubnext performs its own application level data replication, which is probably easier to configure and troubleshoot than database replication3.

When a request arrives at a Hubnext server, it gets wrapped in a tracker (for logging and metrics) and then enters the web request router. Instead of matching by request path, the Hubnext request router gives each registered web handler a chance to accept responsibility for a request before it falls through to the next handler. This allows for more complex and free-form routing logic. The web router starts with basic things like an OPTIONS handler and DoS rejection. The vast majority of requests are handled by the Hubnext blob cache and page cache. These handlers keep bounded in-memory caches of popular blobs (images, JavaScript, and CSS) and pre-rendered versions of popular pages. But even on a cache hit, the server still runs some code to process things like cache headers, ETags, and compression.

Blobs whose request path starts with /_version_ get long cache expiration times, which instructs Google Cloud CDN to cache them. Overall, about 40% of the requests to RogerHub are served from the CDN without ever reaching Hubnext. You can distinguish these cache hits by their Age header. ETags are generated from the SHA256 hash of the blob’s path and payload. Compression is applied to blobs based on a whitelist of compressible MIME types. The blob cache holds both the original and the gzipped versions of the payload, along with pre-computed ETags and other metadata, like the blob’s modification date.

All of the remaining requests4 eventually make their way to PostgreSQL for content. As I said in a previous blog post, Hubnext stores all of its content in PostgreSQL, including binary blobs like images and PDFs. This isn’t a problem, because Hubnext communicates with its database over a local unix socket and most blobs are cached in memory anyway. This does prevent Hubnext from using sendfile to deliver blobs like filesystem-based web servers do, but there aren’t many large files hosted on RogerHub anyway.

At this point, nearly all requests have already been served by one of the previously mentioned handlers. But the majority of Hubnext’s web server code is dedicated to serving the remaining fraction. This includes archive pages, search results, new comment submissions, the RSS feed, and legacy redirect handlers. If all else fails, Hubnext gives up and returns a 404. The response is sent back to the load balancer over HTTP/1.1 with TLS5 and then forwarded on to the user. Thus completes the request.

Is the serving path shorter than you imagined? I think so6. More code sometimes just means additional maintenance burden and a greater attack surface. But altogether, Hubnext actually consists of far less code than the code it replaces. Furthermore, it’s nearly all code that I wrote and it’s all written in a single language as part of a single system. Say what you want about Go, but I think it’s the best language for implementing traditional CRUD services, especially those with a web component7. Hubnext is a distillation of what it means to deliver RogerHub reliably and efficiently to its audience, without the frills and patches of a traditional website. Anyway, I hope this post was a good distraction. Until next time!

  1. Plus, I happen to work for Google (but I didn’t when I first signed up with GCP), so I hear a lot about these services at company meetings. ↩︎
  2. If required, Hubnext could theoretically run on 10-20 instances with no problem. But the peer to peer syncing protocol is meant to be a fully connected graph, so at some point, it might run into issues with too much network traffic. ↩︎
  3. Hubnext also doesn’t use PostgreSQL’s backup tools, because Hubnext can create application-specific backups that are more convenient and understandable than traditional database backups. ↩︎
  4. Some pages simply aren’t cached, like search result pages and 404s. ↩︎
  5. Google HTTP(S) Load Balancing uses HTTP/1.1 exclusively when forwarding requests to backend servers, although it prefers HTTP/2 on the front end. ↩︎
  6. But for several years, RogerHub just ran as PHP code on cheap shared web hosts, which makes even this setup seem overly complicated. But it’s actually quite a bit of work to run and maintain a modern PHP environment. You’ll probably need a dozen C extensions and a reverse proxy and perhaps a suite of cron jobs to back up all your files and databases. For a long time, I followed a bunch of announcement mailing lists to get news about security fixes in PHP and WordPress and MediaWiki and NGINX and MySQL and all the miscellaneous packages on ubuntu-security-announce. This new serving path means I really only need to keep an eye out on security announcements for Go (which tend to be rare and fairly mild) or OpenSSH or maybe PostgreSQL. ↩︎
  7. After all, that covers a lot of what Google does. ↩︎

Something else

I talk a lot about one day giving up computers and moving out, far, far away to a place where there aren’t any people, in order to become a lumberjack or a farmer. Well lately, it’s been not so much “talking” as instant messaging or just mumbling to myself. Plus, I don’t have the physique to do either of those things. My palms are soft and I’m prone to scraping my arms all the time. I like discipline as a virtue, but I also don’t really like working. And finally, my hobby is proposing stupid outlandish half-joking-half-serious expensive irresponsible plans. It’s fun, and I guess you can’t really say something’s a bad idea until you’ve thought it through yourself.

Joking aside, my motivation comes from truth. Computers are terrible. It becomes more and more apparent to me every year that passes. I want to get far, far away from them. They’re the source of most of my problems1. And lately, it has become clear that lots of other people feel the same way.

Terribleness isn’t the computer’s fault. It all comes from a fundamental disconnect between what people think their computers are and the terrible reality of it all. A massive fraction of digital data is at constant risk of being lost forever, because it’s stored without redundancy on consumer devices scraped from the bottom of the barrel. Critical security vulnerabilities lurk in every computer system. Most of the time, they aren’t discovered simply because nobody has bothered looking hard enough. But given the lightest push, so much software2 just falls apart. You can break all kinds of software just by telling it to do two things at once. Load an article and swipe back at the same time? Bam! Your reddit app UI is now unusable. Submit the same web form from two different tabs? Ta-da! Your session state is now corrupt.

At this moment, my Comcast account is fucked beyond repair after a series of service cancellations, reinstatements, and billing credits put it in an unusual state. I needed to order internet service at my next apartment. I couldn’t do it on the website, probably because of the previous issues. I also didn’t think it was worthwhile explaining to yet another customer service rep over why my account looked the way it did. So, I just made a new account and ordered it there3.

Boats at Fisherman's Wharf in Monterey Bay.

You can get rid of all these problems if you try hard enough. Some people simply don’t own any data. Their email and photos and documents and (nonexistent) depositories of old code just come and go along with the devices they’re stored on. They don’t have data at risk of compromise, because they don’t have any digital secrets and important computer systems (like their bank) just have humans double-checking things every step of the way.

Alternatively, you could put expiration dates on your data. Email keeps for 2 years. Photos: five. Sort your data into annual folders, and when the expiration date passes, simply delete it. This strategy takes the focus off of maintaining perfect data integrity and opens up a new method to measure your success. At the end of the year, if you never had trouble finding an old photo and your credit score was doing alright, then declare success.

Maybe you could live somewhere where people don’t really need computers. Maybe they have a phone that only makes calls and a nice big TV and maybe some books and paper and stuff to do outside. Or maybe they have iPads too—little TV’s that you can touch.

Or you could simply come to terms with the way things are. After all, your data only needs to last as long as your own body does.

A rusty old tractor.

I stayed on a farm over Memorial Day weekend. It was owned by a couple. They had a tractor and some trucks and dogs and a jaccuzi and two little huts to rent out on Airbnb. I wonder if they had computers at their farm, or if they just had iPads.

Do they enjoy having a lot of land? Or is it wearisome waking up to the same rusting trucks and dirt roads every day? They probably have a lot of privacy, on account of their lack of upstairs, downstairs, and adjacent neighbors in their non-apartment home. Although, that’s probably less true if internet strangers are renting out their cabins all the time.

They had rows of crops, like the ones you see along highway 5 in central California. I wasn’t sure exactly who they belonged to, since their nearest neighbor was a solid 10 minute walk down the road. I wonder if they’re all friends.

My Subaru Impreza parked outside a cabin.

I spend a lot of my weekends working on my personal infrastructure. I can’t quite explain what it is I’m working on or why it needs work at all. It’s a combination of data management, software sandboxing, and build systems. At some point, I have to wonder if I’m spending more time working on my tools than actually using those tools to do stuff.

I also have a bunch of to-do’s on my Todoist account. The length of those lists has grown considerably since I graduated college a year ago4. I can’t believe it’s already been more than a year since then. How soon will it be two years? Did I live this past year right? Did I accomplish enough things?

Fields in Salinas.

I guess the answer really depends on how you measure “enough”. A year ago, I told myself that I wanted to become the best computer programmer that I could be. I feel like I’ve made lots of progress on that front. I picked up lots of good habits at work and my personal infrastructure is better than it’s ever been. I did a lot of cool stuff with RogerHub, and it finally has infrastructure that I can be proud of. But somehow, I’m also more dissatisfied with my computing than ever before.

I’m still living in the same area, and I’ve just committed to living here for yet another year. My plan had always been to stay in the Bay Area for three or four years, and then move away. But where to, and what for? Right now, I can’t even imagine what it’d be like to leave California5.

I’ve gotten better at cooking. More specifically, I think I’ve gotten more familiar with what kinds of food I like. On most days, I cook dinner for myself at home, and then I watch YouTube and browse the internet until I have to go to sleep. If it’s May or December, I also have to answer a few math questions.

But I haven’t made any new friends in the last year, other than my direct coworkers. I’ve spent more free time at home than ever before. I spend a small fortune on rent, after all. I might as well try to get some value out of it.

Bread and strawberries on a table.

Three years ago, I remember thinking that I didn’t know any adults who were actually happy, from which I concluded that growing up just kind of sucked.

I rarely drive anymore. I used to think driving at night on empty roads was nice and relaxing. But the roads around here just feel crowded and dangerous.

I’ll continue to work on computers, because that’s all I’m good at. More often than not, the last thought in my head before I fall asleep is an infinite grid of computers. On the bright side, I fixed my chair today and I bought a new toilet seat. As long as there’s a trickle of new purchases arriving at my doorstep, maybe things will be alright.

This post was kind of a downer. I’ll return with more website infrastructure next week6.

And since I haven’t posted a photo of me in a while, here’s a recent one:

Roger sitting by the ocean.

The photo on my home page doesn’t even look like me anymore. But I like it as an abstract icon, so it’ll probably stay there for a while, until I change it to something else.

  1. Which, I suppose, is pretty fortunate compared to the alternative. ↩︎
  2. You might blame software instead of the computer itself, but to most people, they’re the same thing. ↩︎
  3. I’m grateful I have such easy access to additional email addresses and phone numbers. ↩︎
  4. In fact, I only recently discovered that Todoist imposes a 200 task limit per project. ↩︎
  5. In any case, I realized this past weekend that I have way too many physical belongings to haul around, if I ever want to move. ↩︎
  6. I backdated this to July 31st, since that’s when I wrote most of the content. ↩︎

Training angst

Have you ever used Incognito Mode because you wanted to search for something weird, but you didn’t want it showing up in targeted ads? Or have you ever withheld a Like from a YouTube video, because although you enjoyed watching it, you weren’t really interested in being recommended more of the same? I have. And since I can’t hear you, I’ll assume you probably have too. People have gotten accustomed to the idea of “training” their computers to behave how they want, much like you’d train a dog or your nephew. And whether you study computer science or psychology or ecology or dog stuff, the principles of reinforcement learning are all about the same.

The reason you don’t search weird stuff while logged in or thumbs-up everything indiscriminately is that you’re trying to avoid setting the wrong example. But occasional slip ups are a fact of life. To compensate, many machine learning models offer mechanisms to deal with erroneous (”noisy”) labels in training data. The constraints of a soft margin SVM include a slack term that represents a trade-off between classification accuracy and resilience against mislabeled examples. Because computers can’t tell which of its training examples are mislabeled and which are simply unusual, it does the next best thing: each example can be rated based on how “believable” it is in comparison to other examples. Then, finding the optimal parameters is simply a matter of minimizing unbelievability.

Avoiding bad examples is in your best interest if you want the algorithm to give you the best recommendations. So, your YouTube Liked Videos list is probably only a rough approximation of the videos that you actually like1. Now, a computer algorithm won’t mind if you lie (a little) to it. But the real tragedy is that as a side effect, YouTube has effectively trained you, the user, to give Likes not to the videos you actually like, but to the kinds of videos you want recommendations for. In fact, most kinds of behavioral training induce this kind of reverse effect. The trainer lies a little bit to the trainee, in order to push the training in the right direction. And in return, the trainer drifts a little farther from the truth.

Parents do this to their children. Friends do it to their friends. Even if you try to be honest, the words you say and the reactions you make end up deviating ever so slightly from the truth, because you can’t help but think that your actions will end up somebody’s brain as subconscious behavioral training examples2. If your friend invites you to do something you don’t want to do, maybe you’ll say yes, or else they might not even ask next time. And if they say something you don’t like, maybe you’ll act angrier than you really are, so they won’t mention it ever again. Every decision starts with “how do I feel about this?”, but is quickly followed up with “how will others feel about my feeling about this?”. This isn’t plain old empathy. It’s true that human-to-human behavioral training helps people get along with each other. But when our words and actions are influenced by how we think they’ll affect someone else’s behavior, they end up being fundamentally just another form of lying. And unlike a computer recommendation algorithm, people might actually hate you for lying to them.

This has been weighing on my mind for a long time. I think it’s unbearably hard for adults to be emotionally honest with each other, even for close friends or family. But the problem isn’t with the words we say. Of course, people want others to think of them a certain way, whether it’s about your money or your job or your passions or romance or mental health. And people lie about those things all the time. That isn’t what keeps me up at night. What bothers me is that even when you’re trying your best to be absolutely honest with someone, you can’t. You say the right words, but they don’t sound right. You feel the right feelings, but your face isn’t cooperating. Your eyes get hazy from years of emotional cruft, and you’re no longer able to really see the person right in front of you. And it’s all because we spend every day training each other with almost-truths.

A flower.

  1. The Like button is actually short for the “recommend me more videos Like this one” button. ↩︎
  2. Have you watched Bojack Horseman? ↩︎

Data loss and you

My laptop’s hard drive crashed in 2012. I was on campus walking by Evans Hall, when I took my recently-purchased Thinkpad x230 out of my backpack to look up a map (didn’t have a smartphone), only to realize it wouldn’t boot. This wasn’t a disaster by any means. It set me back $200 to rush-order a new 256GB Crucial M4 SSD. But since I regularly backed up my data to an old desktop running at my parent’s house, I was able to restore almost everything once I received it1.

I never figured out why my almost-new laptop’s hard drive stopped working out of the blue. The drive still spun up, yet the system didn’t detect it. But whether it was the connector or the circuit board, that isn’t the point. Hardware fails all the time for no reason2, and you should be prepared for when it happens.

Data management has changed a lot in the last ten years, primarily driven by the growing popularity of SaaS (”cloud”) storage and greatly improved network capacity. But one thing that hasn’t changed is that most people are still unprepared for hardware failure when it comes to their personal data. Humans start manufacturing data from the moment they’re born. Kids should really be taught data husbandry, just like they’re taught about taxes and college admissions and health stuff. But anyway, here are a few things I’ve learned about managing data that I want to share:

Identify what’s important

Data management doesn’t work if you don’t know what you’re managing. In other words, what data would make you sad if you lost access to it? Every day, your computer handles massive amounts of garbage data: website assets, Netflix videos, application logs, PDFs of academic research, etc. There’s also the data that you produce, but don’t intend to keep long-term: dash cam and surveillance footage (it’s too big), your computer settings (it’s easy to re-create), or your phone’s location history (it’s too much of a hassle to extract).

For most people, important data is the data that’s irreplaceable. It’s your photos, your notes and documents, your email, your tax forms, and (if you’re a programmer) your enormous collection of personal source code.

Consider the threats

It’s impossible to predict every possible bad thing that could happen to your data. But fortunately, you don’t have to! You can safely ignore all the potential data disasters that are significantly less likely to occur than your own untimely death3. That leaves behind a few possibilities, roughly in order of decreasing likelihood:

  • Hardware failure
  • Malicious data loss (somebody deletes your shit)
  • Accidental data loss (you delete your shit)
  • Data breach (somebody leaks your shit)
  • Undetected data degradation

Hardware failures are the easiest to understand. Hard drives (external hard drives included), solid state drives, USB thumb drives, and memory cards all have an approximate “lifespan”, after which they tend to fail catastrophically4. The rule of thumb is 3 years for external hard drives, 5 years for internal hard drives, and perhaps 10 years for enterprise-grade hard drives.

Malicious data loss has become much more common these days, with the rise of a digital extortion scheme known as “ransomware”. Ransomeware encrypts user files on an infected machine, usually using public-key cryptography in at least one of the steps. The encryption is designed so that the infected computer can encrypt files easily, but is unable to reverse the encryption without the attacker’s cooperation (which is usually made available in exchange for a fee). Fortunately, ransomeware is easily detectable, because the infected computer prompts you for money once the data loss is complete.

On the other hand, accidental data loss can occur without anybody noticing. If you’ve ever accidentally overwritten or deleted a file, you’ve experienced accidental data loss. Because it can take months or years before accidental data loss is noticed, simple backups are sometimes ineffective against it.

Data breaches are a unique kind of data loss, because it doesn’t necessarily mean you’ve lost access to the data yourself. Some kinds of data (passwords, tax documents, government identification cards) lose their value when they become available to attackers. So, your data management strategy should also identify if some of your data is condential.

Undetected data degradation (or “bit rot”) occurs when your data becomes corrupted (either by software bugs or by forces of nature) without you noticing. Modern disk controllers and file systems can provide some defense against bit rot (for example, in the case of a bad sectors on a hard disk). But the possibility remains, and any good backup strategy needs a way to detect errors in the data (and also to fix them).

Things you can’t backup

Backups and redundancy are generally the solutions to data loss. But you should be aware that there are some things you simply can’t backup. For example:

  • Data you interact with, but can’t export. For example, your comments on social media would be difficult to backup.
  • Data that’s useless (or less useful) outside of the context of a SaaS application. For example, you can export your Google Docs as PDFs or Microsoft Word files, but then they’re no longer Google Docs.

Redundancy vs backup

Redundancy is buying 2 external hard drives, then saving your data to both. If either hard drive experiences a mechanical failure, you’ll still have a 2nd copy. But this isn’t a backup.

If you mistakenly overwrite or delete an important file on one hard drive, you’ll probably do the same on the other hard drive. In a sense, backups require the extra dimension of time. There needs to be either a time delay in when your data propagates to the backup copy, or better yet, your backup needs to maintain multiple versions of your data over time.

RAID and erasure encoding both offer redundancy, but do not count as a backup.

Backups vs archives

Backups are easier if you have less data. You can create archives of old data (simple ZIP archives will do) and back them up separately from your “live” data. Archives make your daily backups faster and also make it easier to perform data scrubbing.

When you’re archiving data, you should pick an archive format that will still be readable in 30 to 50 years. Proprietary and non-standard archive tools might fall out of popularity and become totally unusable in just 10 or 15 years.

Data scrubbing

One way to protect against bit rot is to check it periodically against known-good versions. For example, if you store cryptographic checksums with your files (and also digitally sign the checksums), you can verify the checksums at any time and detect bit rot. Make sure you have redundant copies of your data, so that you can restore corrupted files if you detect errors.

I generate SHA1 checksums for my archives and sign the checksums with my GPG key.

Failure domain

If your backup solution is 2 copies on the same hard drive, or 2 hard drives in the same computer, or 2 computers in the same house, then you’re consolidating your failure domain. If your computer experiences an electrical fire or your house burns down, then you’ve just lost all copies of your data.

Onsite vs offsite backups

Most people keep all their data within a 20 meter radius of their primary desktop computer. If all of your backups are onsite (e.g. in your home), then a physical disaster could eliminate all of the copies. The solution is to use offsite backups, either by using cloud storage (easy) or by stashing your backups at a friend’s house (pain in the SaaS).

Online vs offline backups

If a malicious attacker gains access to your system, they can delete your data. But they can also delete any cloud backups5 and external hard drive backups that are accessible from your computer. It’s sometimes useful to keep backups of your data that aren’t immediately deletable, either because they’re powered off (like an unplugged external hard drive) or because they’re read-only media (like data backups on Blu-ray Discs).


You can reduce your risk of data leaks by applying encryption to your data. Good encryption schemes are automatic (you shouldn’t need to encrypt each file manually) and thoroughly audited by the infosec community. And while you’re at it, you should make use of your operating system’s full disk encryption capabilities (FileVault on macOS, BitLocker on Windows, and LUKS or whatever on Linux).

Encrypting your backups also means that you could lose access to them if you lose your encryption credentials. So, make sure you understand how to recover your encryption credentials, even if your computer is destroyed.

Online account security

If you’re considering cloud backups, you should also take steps to strengthen the security of your account:

  • Use a long password, and don’t re-use a password you’ve used on a different website.
  • Consider using a passphrase (a regular english sentence containing at least 4-5 uncommon words). Don’t share similar passphrases for multiple services (like “my facebook password”), because an attacker with access to the plaintext can easily guess the scheme.
  • Turn on two-factor authentication. The most common 2FA scheme (TOTP) requires you to type in a 6-8 digit code whenever you log in. You should prefer to use a mobile app (I recommend Authy) to generate the code, rather than to receive the code via SMS. Don’t forget to generate backup codes and store them in a physically secure top-secret location (e.g. underneath the kitchen sink).
  • If you’re asked to set security questions, don’t use real answers (they’re too easy to guess). Make up gibberish answers and write them down somewhere (preferably a password manager).
  • If your account password can be recovered via email, make sure your email account is also secure.

Capacity vs throughput

One strong disadvantage of cloud backups is that transfers are limited to the speed of your home internet, especially for large uploads. Backups are less useful when they take days or weeks to restore, so be aware of how your backup throughput affects your data management strategy.

This problem also applies to high-capacity microSD cards and hard drives. It can take several days to fully read or write a 10TB data archival hard drive. Sometimes, smaller but faster solid state drives are well worth the investment.

File system features

Most people think of backups as “copies of their files”. But the precise definition of a “file” has evolved rapidly just as computers have. File systems have become very complex to meet the increasing demands of modern computer applications. But the truth remains that most programs (and most users) don’t care about most of those features.

For most people, your “files” refers to (1) the directory-file tree and (2) the bytes contained in each file. Some people also care about file modification times. If you’re a computer programmer, you probably care about file permission bits (perhaps just the executable bit) and maybe symbolic links.

But consider this (non-exhaustive) list of filesystem features, and whether you think they need to be part of your data backups:

  • Capitalization of file and directory names
  • File owner (uid/gid) and permission bits, including SUID and sticky bits
  • File ACLs, especially in an enterprise environment
  • File access time, modification time, and creation time
  • Extended attributes (web quarantine, tags, “hidden”, and “locked”)
  • Resource forks, on macOS computers
  • Non-regular files (sockets, pipes, character/block devices)
  • Hard links (also “aliases” or “junctions”)
  • Executable capabilities (maybe just CAP_NET_BIND_SERVICE?)

If your answer is no, no, no, no, no, what?, no, no, and no, then great! The majority of cloud storage tools will work just fine for you. But the unfortunate truth is that most computer programmers are completely unaware of many of these file system features. So, they write software that completely ignores them.

Programs and settings

Programs and settings are often left out of backup schemes. Most people don’t have a problem reconfiguring their computer once in a while, because catastrophic failures are unlikely. If you’re interested in creating backups of your programs, consider finding a package manager for your preferred operating system. Computer settings can usually be backed up with a combination of group policy magic for Windows and config files or /usr/bin/defaults for macOS.

Application-specific backup

If you’re backing up data for an application that uses a database or a complex file-system hierarchy, then you might be better served by an backup system that’s designed specifically for that application. For example, RogerHub runs on a PostgreSQL database, which comes with its own backup tools. But RogerHub uses an application-specific backup scheme that’s tailored to RogerHub specifically.


A backup isn’t a backup until you’ve tested the restoration process.


If you’ve just skipped to the end to read my recommendations, fantastic! You’re in great company. Here’s what I suggest for most people:

  • Use cloud services instead of files, to whatever extent you feel comfortable with. It’s most likely not worth your time to backup email or photos, since you could use Google Inbox or Google Photos instead.
  • Create backups of your files regularly, using the 3-2-1 rule: 3 copies of your data, on 2 different types of media, with at least 1 offsite backup. For example, keep your data on your computer. Then, back it up to an online cloud storage or cloud backup service. Finally, back up your data periodically to an external hard drive.
  • Don’t trust physical hardware. It doesn’t matter how much you paid for it. It doesn’t matter if it’s brand new or if you got the most advanced model. Hardware breaks all the time in the most unpredictable ways.
  • Don’t buy an external hard drive or a NAS as your primary backup destination. They’re probably no more reliable than your own computer.
  • Make sure to use full-disk encryption and encrypted backups.
  • Make sure nobody can maliciously (or accidentally) delete all of your backups, simply by compromising your primary computer.
  • Consider making archives of data that you use infrequently and no longer intend to modify.
  • Secure your online accounts (see section titled “Online account security”)
  • Pat yourself on the back and take a break once in a while. Data management is hard stuff!

If you find any mistakes on this page, let me know. I want to keep it somewhat updated.

And, here’s yet another photo:


  1. My laptop contained the only copy of my finished yet unsubmitted class project. But technically I had a project partner. We didn’t actually work together on projects. We both finished each project independently, then just picked one version to submit. ↩︎
  2. About four and a half years later, that m4 stopped working and I ordered a MX300 to replace it. ↩︎
  3. That is, unless you’re interested in leaving behind a postmortem legacy. ↩︎
  4. There are other modes of failure other than total catastrophic failure. ↩︎
  5. Technically, most reputable cloud storage companies will keep your data for some time even after you delete it. If you really wanted to, you could explain the situation to your cloud provider, and they’ll probably be able to recover your cloud backups. ↩︎