Estimated taxes

I forgot to pay my taxes. Yes, I know it’s September, and no, I’m not talking about my annual tax return. For the last few months, I forgot to pay estimated tax, which is a thing that some individuals and corporations in the United States are required to pay, four times a year. Most people pay their taxes primarily in the form of payroll tax withholding, but as someone who is both self-employed and regular employed, I basically have to estimate my remaining year-end tax burden and pay one-fourth of that every 3 months. These payments are subsequently deducted from my tax bill at the end of the year. If I don’t pay them, there’s an additional “underpayment penalty” of about 3%. Luckily, I can still make payments now and avoid most of that penalty. I typically set up calendar reminders to remember to pay my estimated taxes, but I had forgotten to set them up this year (and didn’t have the foresight to just make them recurring reminders). I’ve got periodic reminders to do lots of things: updating my checkbook every 2 months, exporting my Chrome bookmarks every month, scheduling car maintenance once a year, and going to the dentist… I should probably create a reminder for that. At this stage in life, I can hardly remember to do anything without automatic reminders. But reminders aside, I had never forgotten to pay my estimated taxes before.

Unexpected tax payments suck, especially when they cost more than a whole year’s worth of rent (have I already mentioned how rent is also absurdly high here?). But at the end of the day, taxes and penalties are only a matter of money. I’m actually more uncomfortable that this oversight is probably a red flag for how my life has been progressing so far this year. Not only did I forget to pay my taxes, but it took me more than 5 months to notice. It feels like half a year has flown by on autopilot. What else might I have forgotten?

I never forget to brush my teeth or eat breakfast in the morning. Fortunately, paying taxes isn’t part of my daily routine, but it’s still unusual for me to have forgotten about doing it entirely. I’d have expected my thoughts to just stumble upon the topic by chance, given enough time. Even so, I forgot. That must mean I just haven’t had enough free time lately, right? But I can’t imagine that truthfully. Although my work day has gotten busier since last autumn, I’ve made sure to keep strict business hours, which leaves me plenty of leisure time in the evenings and on the weekends. I have time to cook dinner, exercise, watch anime, and practice lots of social distancing. It would be awfully convenient if I could blame my forgetfulness on a busy schedule, but I really don’t think that’s right.

I’ve been watching a lot more YouTube this year. Since the pandemic began, I’ve found comfort in the constant background noise of humans speaking (even if most of that speaking is actually incomprehensible Japanese). I listen while driving, while making dinner, while brushing my teeth, and even while taking a shower. For most people, these solo activities offer important opportunities to spend time alone with their thoughts. While the hands are busy performing easy routine tasks, the mind is allowed to wander. Did I inadvertently ruin this time for myself by filling it with YouTube videos? Instead of listening to my own thoughts, I listen to tech news, legal commentary, urban design, history, and seiyuu radio programs. As a result, there’s hardly any time in my day anymore when I’m not working, sleeping, or being entertained.

Perhaps free time and leisure time are different things. Free time ought to mean sitting still, not looking at anything, not listening to anything, and not needing to focus on anything in particular. It’s different from leisure time, which occupies your thoughts with entertainment. It’s also different from meditation, which requires focus and mindfulness. Free time offers a chance for your mind to stop working, to just be idle, and to think about whatever it wants. I think it’s important to protect your free time and allow your mind to take breaks. Just like an overutilized CPU starves low priority threads, an overutilized mind easily forgets nonessential thoughts—remembering your friend’s birthday, calling home, visiting the dentist, and paying taxes.

How long have I been at 100% utilization? Probably at least a few months. Nobody likes sitting idle, so we fill our time with pointless activities to stave off the boredom. Time rushes forward. Take a paid vacation to catch up, but waste the whole time being stressed about what to do. After all, you’re only allowed to make plans until the day you return to work. Maximize leisure. Never sit still. When you return, rejoice because you missed working, not because anything has materially improved. Maybe what you really needed wasn’t a vacation, but just a chance to properly reflect on your thoughts (and then also a vacation). It won’t make time move any slower, but it might offer a chance to change its course once in a while.

It’s the first time I haven’t used a single footnote in a while. ↩︎

Network traffic analysis

In the last 8 months, I’ve captured and analyzed 4.4 billion network packets from my home internet connection, representing 3833 GiB of data. Life basically occurs via the internet when you’re stuck at home. Facetime calls, streaming anime, browsing reddit, and even video conferencing for my job — all of it gets funneled into my custom network monitoring software for decoding, logging, and analysis. Needless to say, most internet communication uses encrypted TLS tunnels, which are opaque to passive network observers. However, there’s a ton of interesting details exposed in the metadata — as network engineers and application programmers, we’ve repeated this adage to ourselves countless times. However, few people have actually inspected their own internet traffic to confirm exactly what kind of data they’re exposing this way. Developing my own network traffic analysis software has helped me solidify my understanding of network protocols. It’s also given me a really convenient network “watchtower”, from which I can easily do things like take packet dumps to debug my IPv6 configuration. I think anyone interested in computer networking should have access to such a tool, so in this post, I’ll describe my approach and the problems I encountered in case you’re interested in trying it out yourself.

The tap

Before I begin, I should probably remind you only to capture network traffic when you’re authorized to do so¹. Naturally, the world of network protocols is so exciting that it’s easy to forget, but some people are seriously creeped out by network surveillance. Personally, I probably do a lot more self-surveillance (including security cameras) than most people are comfortable with, but do keep that in mind in case you aren’t the only one using your home internet connection.

In any case, the first thing you need is a network tap. My router offers port mirroring, and I’m personally only interested in my WAN traffic, so I simply configured my router to mirror a copy of every inbound and outbound WAN packet to a special switch port dedicated to traffic analysis. If your router doesn’t support this, you can buy network tap devices that essentially do the same thing. On the receiving end, I plugged in my old Intel NUC — a small form factor PC equipped with a gigabit LAN port² and some decent solid state storage. Any decently fast computer would probably suffice, but if you can find one with two ethernet ports, that might save you some trouble later on. My NUC only had a single port, so I used it both as the management interface and for receiving the mirrored traffic.

Capture frames

At this point, sniffed network frames are arriving on your analysis device’s network interface card (NIC). However, you’ll need to do a bit more work before they can be delivered properly to your application for analysis. When my router mirrors a switch port, it sends a replica of the exact ethernet frame to the mirror target. This means that the MAC address of the original packet is preserved. Normally, NICs will simply discard packets with an irrelevant destination MAC address, but we can force the NIC to accept these packets by enabling “promiscuous mode”. Additionally, the operating system’s networking stack will try to decode the IP and TCP/UDP headers on these sniffed packets to route them to the correct application. This won’t work, of course, since the sniffed packets are typically intended for other computers on the network.

We want our network analysis software to configure the NIC in promiscuous mode and to receive all incoming packets, regardless of their IP and layer 4 headers. To accomplish this, we can use Linux’s packet sockets³. You’ll find plenty of filtering options in the man pages, but I’m using SOCK_RAW and ETH_P_ALL, which includes the ethernet header and does not filter by layer 3 protocol. Additionally, I add the socket to PACKET_MR_PROMISC on interface 2 (eth0 on my NUC), which enables promiscuous mode. These options typically require root privileges or at least special network capabilities, so you may need to elevate your privileges for this to work. Now that you’ve set up your socket, you can start calling recvfrom to grab frames.

Packet mmap

Fetching packets one a time using recvfrom is perfectly fine, but it requires a context switch for each packet and may not be performant enough under heavy workloads. Linux provides a more efficient mechanism to use packet sockets called PACKET_MMAP, but it’s a bit tricky to set up, so feel free to skip this section and come back later. Essentially, packet mmap allows you to configure a ring buffer for sending and receiving packets on a packet socket. I’ll only focus on receiving packets, since we’re creating a network traffic analyzer. When packets arrive on the socket, the kernel will write them directly to userspace memory and set a status flag to indicate that it’s ready. This allows us to receive multiple packets without a context switch.

To set up packet mmap, I allocated an 128MiB ring buffer for receiving packets. This buffer was divided into frames of 2048 bytes each, to accomodate ethernet frames of approximately 1500 bytes. I then set up an epoll socket on the packet socket to notify me when there were new packets available. A dedicated goroutine calls epoll_wait in a loop and reads frames from the ring buffer. It decodes the embedded timestamp, copies the bytes to a frame buffer, and resets the status flag. Once 128 frames are received, the batch of frames is sent to a Go channel to be decoded and analyzed by the pool of analysis goroutines.

I added a few more features to improve the robustness and performance of my sniffer goroutine:

Packet timestamps are compared to the wall clock. If the drift is too high, then I ignore the packet timestamp.
Frame buffers are recycled and reused.
The loop sleeps for a few milliseconds after each iteration. The duration of the sleep is based on the number of non-empty frames received. The sleep duration is adjusted with AIMD (up to 100ms) and targets an optimal “emptiness” of 1 in 64. This helps reduce wasted CPU cycles while ensuring the buffer never becomes full⁴.
Every hour, I use the PACKET_STATISTICS feature to collect statistics and emit a warning message if there were any dropped frames.

One final note about packet sniffing: my NIC automatically performs TCP segmentation offload, which combines multiple TCP packets into a single frame based on their sequence numbers. Normally, this reduces CPU usage in the kernel with no drawback, but these combined frames easily exceed my 2048 byte frame buffers. It also interferes with the accurate counting of packets. So, I use ethtool to turn off offload features before my network analysis software starts.

Decode

At this point, you have a stream of ethernet frames along with their timestamps. To do anything useful, we need to decode them. There are lots of guides about decoding network protocols, so I won’t go into too much detail here. My own network analysis software supports only the protocols I care about: ethernet, IPv4, IPv6, TCP, UDP, and ICMPv6. From each frame, I build a Packet struct containing information like MAC addresses, ethertype, IP addresses, port numbers, TCP fields, and payload sizes at each layer. The trickiest part of this was decoding the ethernet header. Wikipedia will tell you about the different Ethernet header formats, but in practice, I observed only 802.2 LLC for the spanning tree protocol and Ethernet II for everything else. I also added support for skipping 802.1Q VLAN tags, even though I don’t currently use them on my network. I also learned that the IPv6 header format doesn’t tell you the start of the layer 7 data, but instead only gives you the type of the next header. To properly find the start of the payload, you’ll need to decode each successive IPv6 extension header to extract its next header and header length fields.

Local addresses

Since I’m analyzing internet traffic, I want to able to categorize inbound versus outbound WAN packets. Both transmitted and received frames are combined into a single stream when mirroring a switch port, so I built heuristics to categorize a packet as inbound or outbound based on its source and destination IP addresses. For IPv4 packets, all WAN traffic must have either a source or destination IP address equal to the public IPv4 address assigned by my ISP. There are protocols like UPnP that allow a local device to discover the public IPv4 address, but the simplest solution was to just look it up using my existing DDNS record, so I went with that. For IPv6, I use the global unicast addresses configured on the network interfaces of the network traffic analyzer machine itself. The interface uses a /64 mask, but I expand this automatically to /60 since I request extra subnets via the IPv6 Prefix Length Hint from my ISP for my guest Wi-Fi network. Classifying inbound versus outbound IPv6 packets is as simple as checking whether the source IP or destination IP falls within my local IPv6 subnets⁵.

Duplicate frames

One disadvantage of using the same ethernet port for management and for traffic sniffing is that you end up with duplicate frames for traffic attributed to the analyzer machine itself. For outbound packets, my sniffer observes the frame once when it leaves the NIC and again when it gets mirrored as it’s sent on the WAN port. For inbound packets, my router forwards two copies of the packet that are nearly identical, except for their MAC addresses and the time-to-live field. To avoid double counting this traffic, I skip all IPv4 packets where the source or destination IP address is an address that’s assigned to the analysis machine itself. Since all of my WAN IPv4 traffic uses network address translation, I only need to count the copy of the packet that uses the public IPv4 address. For IPv6, I deduplicate traffic by keeping a short-term cache of eligible packets and discarding nearly identical copies. When evaluating possible duplicates, I simply ignore the ethernet header and zero out the TTL field. A 1-second expiration time prevents this deduplication cache from consuming too much memory.

Neighbor discovery

My network traffic analyzer mostly ignores the source and destination MAC address on captured packets. Since it’s all WAN traffic, the MAC addresses basically just show my router’s WAN port communicating with my ISP’s cable modem termination system. As a result, you can’t tell which of my devices was responsible for the traffic by looking at only the ethernet header. However, you can use the Neighbor Discovery Protocol (NDP) to attribute a MAC address to an IPv6 address, which accomplishes almost the same thing.

For every local IPv6 address observed by my network traffic analysis software, I attempt to probe for its MAC address. I open a UDP packet to the address’s discard port (9) and send a dummy payload. Since the address is local, Linux sends out an NDP neighbor solicitation to obtain the MAC address. If the device responds, then the neighbor advertisement is captured by my packet sniffer and gets decoded. Over time, I built up a catalog of IPv6 and MAC address pairs, which allows me to annotate IPv6 packets with the MAC address of corresponding device. This doesn’t work for IPv4, of course, but I’m able to achieve a similar effect using static DHCP leases and allocating NAT port ranges for those static leases. This allows me to infer a MAC address for IPv4 packets based on their source port number.

Flow statistics

At this point, you have a stream of decoded packets ready for analysis. I classify each IP packet into a flow based on its IP version, IP protocol, IP addresses, and port numbers (commonly called a “5-tuple”, even though the IP version counts as a 6th member). For each flow, I track the number of transmitted and received packets, along with the cumulative size at layers 3, 4, and 7, the timestamps of the first and last packet, and the 5-tuple fields themselves. Each flow is assigned a UUID and persisted to a PostgreSQL database once a minute. To save memory, I purge flows from the flow tracker if they’re idle. The idle timeout depends on a few factors, including whether the flow is closed (at least one RST or FIN packet), whether the flow was unsolicited, and whether the flow used UDP. Regular bi-directional TCP flows are purged after 15 minutes of inactivity, but if the flow tracker starts approaching its maximum flow limit, it can use special “emergency” thresholds to purge more aggressively.

I also track metrics about network throughput per minute. There’s not much worth saying about that, other than the fact that packets can arrive at the analyzer out of order due to my sniffer’s batch processing. Throughput metrics are also persisted to PostgreSQL once a minute, and they allow me to graph historical bandwidth usage and display realtime throughput on a web interface.

TCP reconstruction

The next step for traffic analysis is to reconstruct entire TCP flows. Since most traffic is encrypted, I’m only really interested in the preamble immediately following connection establishment, which I defined as the first 32 packets of any TCP flow. My TCP tracker keeps track of up to 32 transmitted and received TCP payloads per flow. If all the conditions are met (syn observed, sequence number established, and non-empty payload in both directions), then I freeze that flow and begin merging the packets. I start by sorting the packets based on sequence number. In theory, TCP packets could specify overlapping ranges (even with conflicting data), but this almost never happens in practice. My TCP reconstruction assumes the best case and generates a copy offset and copy length for each packet, ignoring overlapping portions. Special consideration is required for sequence numbers that overflow the maximum value for a 32 bit integer. Instead of comparing numbers directly, I subtract them and compare their value with the 32-bit midpoint (0x80000000) to determine their relative order. If there are any gaps in the preamble, then I truncate the stream at the beginning of the gap to ensure I’m only analyzing a true prefix of the payload.

Once the TCP flow is reconstructed, I pass the transmitted and received byte payloads to a layer 7 analyzer, chosen based on the port number. Currently, I’ve only implemented analyzers for HTTP (port 80) and TLS (port 443). The HTTP analyzer just decodes the client request and extracts the HTTP host header. The TLS analyzer is described in the next section.

TLS handshake

HTTPS represents a majority of my network traffic, so it seemed worthwhile to be able to decode TLS handshakes. A TLS stream consists of multiple TLS records, which are prefixed with their type, version, and length. My TLS decoder starts by decoding as many TLS records as permitted by the captured bytes, but it stops once it encounters a Change Cipher Spec record since all the subsequent records are encrypted. With any luck, the stream will begin with one or more TLS handshake records, which can be combined to form the TLS handshake messages. I’ve currently only implemented support for decoding the TLS ClientHello, which contains most of the things I care about, such as the protocol version, cipher suites, and most importantly the TLS extensions. From the TLS extensions, I can extract the hostname (server name indication extension) and the HTTP version (application level protocol negotiation extension).

I’m currently not doing very much with this information, other than annotating my flow statistics with the HTTP hostname (extracted either from the Host header or from TLS SNI). In the future, I’d like to support QUIC as well, but the protocol seems substantially harder to decode than TLS is.

Closing thoughts

I think I have a much stronger grasp on network protocols after finishing this project. There are a lot of subtleties with the protocols we rely on every day that aren’t obvious until you need to decode them from raw bytes. During development, I decoded a lot of packets by hand to confirm my understanding. That’s how I learned about GREASE, which I initially assumed was my own mistake. Now, I regularly check my network traffic analyzer’s database to debug network issues. I’ve also been able to double check my ISP’s monthly bandwidth cap dashboard using my own collected data. It’s also a really convenient place to run tcpdump as root, just to check my understanding or to test out experiments. If you’re interested in computer networking, I encourage you to try this out too.

Not a lawyer. Not legal advice. ↩︎
Since I’m only monitoring WAN traffic, a gigabit port is sufficient. If I monitored LAN traffic too, then the amount of mirrored traffic could easily exceed the line rate of a single gigabit port. ↩︎
The raw socket API also provided this functionality, but it only supported IPv4 and not IPv6. However, you’ll still hear this term commonly being used to describe this general technique. ↩︎
At maximum speed, my internet connection takes about 3.6 seconds to fill a 128MiB buffer. ↩︎
This typically works, but I occasionally spot packets with IP addresses owned by mobile broadband ISPs (T-Mobile and Verizon) from my smartphones being routed on my Wi-Fi network for some reason. ↩︎

Hisashiburi

Hey reader, it’s been a long time. I hope you’re doing well. You haven’t heard from me in a while, but it’s not because I’ve been busy. If anything, I’ve had more free time this year than ever before. I recently celebrated three hundred days of sheltering in place. In a couple of months, it’ll have been a full year. How long will it be before things return to normal? I’m looking forward to lots of things, including fetching my lightning charger dock from the office, eating fast food instead of cooking all the time, and for thousands of people to stop dying in overwhelmed hospitals (but mostly the fast food). On the other hand, I’ve felt grateful for lots of things during this pandemic. I’m grateful that my comfy job lets me easily work from home. I’m grateful for the twenty 3M N95 respirators I bought three years ago after those California wildfires but never used. I’m grateful for the big TV I bought on a whim in January that I now watch for hours every day, cooped up indoors. A few more things have changed in the months since I last gave you an update. I’ve been slowly taking on more responsibility at work. It’s been lots of fun, but I’ve also noticed some unexpected consequences. I’ve been watching way more anime, and I’ve started watching seiyuu content on YouTube. I thought that my ability to understand raws without subtitles would naturally improve over time, but that hasn’t really worked out. I’ve still been cooking all my meals at home, and I’ve only given myself diarrhea once! I’ve finished two new personal programming projects, one of which I wrote about last July and one that I’ll write about next. Since I’ve stopped nonessential travel, I don’t have any new photos to share, but please enjoy these old vacation photos from last year while I tell you about some more of my thoughts in detail.

Dawn at Fuki no Mori in Nagano Prefecture, Japan.

Moralization of social withdrawal

One of the weirdest long-term effects of the pandemic might be the moralization of social withdrawal. As imperfect humans, we rely heavily on mental shortcuts to quickly remember things and make judgements. This year, we’ve seen health officials all around the world encouraging social distancing while at the same time viral videos of violent anti-maskers and superspreader events circulated online. Transmission rates were out of control and there was no vaccine in sight, so naturally we condemmed those people for their ignorance and selfishness while praising people responsibly staying at home. But then public health orders kept changing while enforcement remained virtually absent. People, weary of restrictions, took it upon themselves to decide what was and wasn’t safe. At the time, society needed to strike a balance between safety, freedom, and economic reality, to be communicated clearly, applied nationally, and enforced universally. Instead, what we got were pointless rules with no real consequences and a spectrum of people with varying degrees of personal responsibility and risk tolerance. So, the condemnation inevitably became increasingly detached from reality and transformed into a general attack on social interaction and having fun. It’ll take a while for us to collectively forget that being close to other people isn’t inherently selfish. This is already true today in countries where transmission rate is minimal and mask acceptance is nearly universal.

On the other side of this, it’s never been a better time to be a shut-in. The pandemic provided a perfect reason to spend every weekend indoors, and for those of us already doing that, this global phenomenon was frankly a relief. Personally, I feel like it might come as a shock if things return to normal faster than expected. There are a lot of things I postponed because of the pandemic, including a long-distance move and a general plan to change my lifestyle. There are also lots of things that only became possible because of the pandemic, like endless amounts of leisure time, an instant commute, and freedom from wearing pants. I ought to start preparing to move on, but these worries are also part of what makes this long quarantine even tolerable.

Should should be considered harmful

The word “should” can either mean “is expected to” or “ought to, but currently does not”. As an example, “it should work” can be interpreted both ways. Ever since someone pointed this out to me, I’ve been more and more frustrated with the ambiguity of should, so I’ve sworn to stop using it wherever possible. Instead, I’ll replace it with “expect” or “ought to” or something similar. But as I’m writing this, I realize that “ought” has the same problem to some extent.

Technical writing is an uncommon skill

Bug reports, incident reports, design specifications, and other technical documents. They’re commonplace in my work, and I’ve been reading many more of them lately. I’ve come to realize that writing concise and comprehensible technical prose, without ambiguity or misunderstanding, is sort of an uncommon skill. While I was a teaching assistant in college, I frequently wrote or reviewed exam questions and project specifications. For every possible ambiguity or misunderstanding in these documents, we’d get a dozen annoying students asking for clarification, so we quickly learned to write things clearly in the first place. Every time I write technical prose at work, I still rely on principles I learned during those years. I’ve never actually formalized those principles before, so here are a few thoughts off the top of my mind:

Start from a place of shared understanding. At the beginning of a document, it’s sometimes useful to assume your reader has no idea what you’re talking about. Before jumping into the details, introduce your objective, lay down some background, and then work your way toward the main topic.
Be concise by deleting all the unnecessary text. If you can’t do that, then at least separate the important text from the background information into separate paragraphs or subsections.
Be redundant when it comes to unusual facts or important points. Saying something twice, either by providing an example or just paraphrasing, helps ensure that the reader doesn’t miss anything critical.
Complex topics need to be introduced like a narrative. If you’re describing a problem that has twelve contributing factors, you need to give some thought to the order in which they’re introduced. Otherwise, a jumbled list of facts just confuses everybody.
Avoid acronyms, or explain them the first time you use them.
Avoid mention facts if it’s unclear why they’re relevant.
Ground your writing in reality. If you describe software architecture in abstract terms like “talks to” or “powered by”, it can be hard for the reader to develop an accurate understanding of how the system works. We have technical terms like “pushes to” and “persists state” that are more descriptive without forcing you to provide too much detail.
Avoid make up new words. If you’re writing about an existing system, be consistent and use the same terminology used in other documents. Also, if you’re writing about a named product or piece of software, look up how its name is stylized in terms of capitalization and punctuation. It may be different than what you’re used to, and using the wrong stylization looks unprofessional.
Be precise with your exact words. Reread your statements, and if you can imagine a “well, technically…” that makes it untrue, then consider addressing it preemtively. Non-commital terms like “is designed to” or “typically” can help make things technically true. Also beware of sentences that be interpreted in more than one way. Technical readers can be hyperliteral, which is sometimes unavoidable while trying to understand complex things.
Provide references. If someone wants to verify the claims you wrote, adding links can help make this easier.

Most of all, it’s important to realize that the quality of technical writing is more than just getting all the facts right. Developing technical writing skills also helps you become a better technical reader, which allows you to provide useful feedback on documents as well as decipher poorly written questions.

Bowl of ramen at Ichiran in Kyoto.

Rich people have stinky toilets

Wealthy people can have a really twisted perspective on the value of time. Consider high earners, for example. If you normally get paid a hundred dollars an hour, then a half-hour trip to the grocery store might cost you $50 in time and $50 for the groceries themselves. When the cashier rings up your papaya at $2.49 instead of the advertised $1.99, it could actually cost you more in opportunity cost to correct the error if it means you spend even one extra minute in the checkout line¹. Lots of problems can be solved with money, and the time-money equivalence frees wealthy people from worrying about small expenditures. However, it also makes some kinds of activities entirely impossible, or rather uneconomical for them to do.

Thrift shopping takes more time. The store locations are usually far away, and searching the random assortment of mismatched sizes for something you like that also fits can take forever. Yet, lots of people still buy clothes this way. As purely a method of obtaining clothing, this approach makes no sense for a high earner. Some people justify thrifting by pointing out the variety of choices is far greater than at regular clothing stores or online, which is a totally valid reason. Combining tangible and intangible value can help rationalize decisions like thrifting that are uneconomical based on time and money alone. Adding intangible value to activities is actually quite common. Without these factors, we’d basically have no reason to do creative hobbies like cooking or woodworking. But what about chores, like cleaning the bathroom? Cleanliness has some intrinsic value, but as a person’s hourly earnings increase, they need to find stronger and stronger intangible value to justify mundane tasks. Either that, or rich people all have stinky toilets.

Well, that’s not quite right. Plenty of wealthy people make money passively even when they’re not actively working, so this basically only applies to high earning wage slaves with no investments? Another counterargument is that rich people hire cleaning services, of course. However, that this kind of thinking doesn’t only apply to monetary wealth.

Consider someone who’s a high earner in terms of productivity, like a big shot at a huge company or a technical director in charge of a popular product. Their usual day to day work can generate massive value for a company, simply by making important decisions or providing direction to the people reporting up to them. However, wielding this amount of influence usually has the unfortunate side effect of completely breaking their ability to do any technical work themselves. It’s not that technical leaders forget the skills required to be individual contributors. Rather, the experience of having once wielded massive influence just saps a person’s motivation to do basic work. A successful engineering team needs both individual contributors and leaders, but the latter role feels so much more impactful. In comparison, the grunt work of programming and getting a new thing to work can feel really unfulfilling.

I’ve felt the effects of this myself over the last decade or so, but to a lesser extent of course. As I pick up more technical skills and experience, my time-productivity exchange rate increases, which makes me less and less motivated to do regular programming work. For a long time, I used the same “intangible value” trick to compensate. Projects like Hubnext and Ef took months of work to complete, but I justified them as points of personal pride. I also poured months into “infrastructure” projects like my Bazel monorepo and my custom service discovery and RPC system, with the understanding that these projects would make it easier to do other future projects. At this point, I’ve already completed most of the long-term projects I wanted to do, and it’s really difficult to motivate myself to start any new ones. There are so many things I could work on, and so I end up working on none of them. I think it’s kind of similar to decision paralysis or writer’s block. No individual thing seems worthwhile by itself, so I end up just giving up and doing something else².

Fabricating intangible value is not sustainable, so how do we fix this? I want to do personal projects again, but unlike money, there’s nothing equivalent to “investing” when it comes to technical work. The only other solution I can imagine it to keep oneself poor. By devaluing time, it’s easier to justify doing time-consuming activities with no clear benefit. I think this is why people who receive windfalls like lottery wins sometimes end up miserable. The time-money exchange rate suddenly spikes and then drops back to normal, which is really disruptive to a person’s motivational energy. Now that I’m aware of this, I think I’ll be more wary of changes in the future. My own toilet is starting to stink.

Sunset from Odaiba Marine Park in Tokyo.

Anime reviews

Since the start of the pandemic, I’ve rewatched seven out of the nine entries on my anime “favorites” list. Of those, I removed two, but the five that remain now remain with redoubled confidence. I plan to finish my reevaluation by the end of the pandemic, but it sounds like we still have at least a few more months to go, so I’ll pace myself with Koe no Katachi and Neon Genesis Evangelion. It’s been a long time goal of mine to periodically reevaluate my anime ratings for accuracy, especially for TV anime. I don’t like changing scores without rewatching first, so I’m glad I’ve had so much time this year to do exactly that. On rewatch, I’ve been focusing especially hard on listening to the original dialogue rather than the translated subtitles. I think shows like Oregairu have so much nuance in the phrasing of lines that gets lost in translation, and the Monogatari series features sentences so long that the translations sometimes don’t make sense when taken piecemeal. I’ve also been trying different viewing environments. Since a lot of anime is mastered in 720p resolution and designed with typical TVs in mind, I think a lot of fancy new TV tech like full-array local dimming and HDR video is unnecessary and only detracts from the format. I’ve tried a combination of my TV, desktop monitor, laptop, and iPhone, along with either speakers and headphones. For most content, I think the low noise floor provided by noise-cancelling headphones makes a big difference³, but picture quality honestly seems about the same regardless of viewing environment.

I’ve also been thinking hard about my anime rating system. At a glance, my scores might look overinflated with most shows receiving a nine or ten. But I’d argue that the conventional rating system, in which most shows score between seven and eight, makes even less sense. If you rate anime purely by matching the average community-wide rating as closely as possible, you’d end up with something like:

10 – Unachievable
9 – Reserved for Full Metal Alchemist: Brotherhood
8 – Every other great show
7 – Average
6 or below – Varying degrees of garbage

A scoring system needs to be granular enough to represent a variety of scores, but coarse enough to be defensible. It’s hard for people to rate very consistently across hundreds of scores, so I think three or four “equivalence classes” is the limit for most people. Then, you might as well put those scores at the top of the scale (like, eight through ten) instead of arbitrarily in the middle. Personally, I rarely finish watching a show that’s trending toward a seven or below, and I generally only rate shows I’ve finished watching. Perhaps my scores only seem high because they don’t include shows I’ve dropped or just never started watching. My scores approximately represent:

10 – Consistently incredible. I’d be happy to watch this again or recommend it to others.
9 – Really great, but falls short for one or more reasons.
8 – I finished watching this, but probably wouldn’t do it again.
7 – I regret finishing this.

I can usually predict a show’s eventual score from the first episode alone, if not the first few minutes. Nowadays, I usually stop watching a show if it’s trending toward an eight or below, but there’s also plenty of shows on my “paused” list that likely would have received a nine if I finished watching them. The decision between a nine and a ten is usually fairly obvious to me by the end of a show. There’s a variety of reasons I withold a perfect score: a lapse in quality mid-cour, an inexcusably boring arc, or lack of thematic cohesion, or sometimes, there simply isn’t anything in a show that makes me want to watch it again⁴. My favorites, in contrast, are entirely orthogonal to my scoring. I don’t consider my favorites list as an “eleven” rating, but instead as a short list of shows and movies that I’ve personally developed a connection with and that exemplify the aspects of anime I care about the most.

The morality of IPv6

I’ve noticed that nothing gets people angry quite like refusing to support IPv6. At first, this seems quite odd, since no other technical standard carries such a sense of self-righteousness. But upon further inspection, I’d argue that adoption of IPv6 is perhaps more of a moral issue than most others:

Adoption is still totally optional, from a practical perspective. There are benefits to IPv6 like simplified address management and globally routable addresses, but most systems continue to work perfectly fine without it.
Adoption requires work that would otherwise be unnecessary. Today, adoption means dual stack, which means a second set of addresses, firewall rules, routing tables, and DNS entries, and a host of new protocols and standards like NDP and SLAAC to understand.
Lack of adoption disproportionately affects the less fortunate, including mobile users and developing nations. In the United States where a unique IPv4 assigned per residence is the norm, IPv4 exhaustion seems like a distant problem.
Lack of adoption negatively affects others. Not every technical standard has these “network effects”, where it only gets better as more people adopt the standard. IPv6 hesitation means the world will likely need to be dual stack for a very long time, forcing IPv6-only networks to deploy stopgaps like NAT64.

My own home network could almost operate entirely as IPv6-only. Some of my custom-made software relies entirely on IPv6, including my IPv6-only DDNS system and my network traffic monitor that classifies flows by decoding NDP neighbor advertisements. The remaining IPv4 connections are mostly for Reddit, random Apple services (from my iPhone), and occasionally video streaming services that use IPv4-only content delivery networks.

Conclusion

Anyway, I’ve rambled on long enough now. Thanks for listening while I got all these topics out of the way. In case you hadn’t noticed, yes, today was just a bunch of blog topics that never fully developed on their own. Anyway, I hope you’re doing well, and happy new year. After 2020, it can’t possibly get any worse, could it?

This kind of reasoning is the whole reason stupidly wasteful things like private planes for corporate executives exist. ↩︎
The one exception has been a few spurts of work on my personal Chromecast app that I rely on regularly to watch anime, where the immediate payoff still exceeds the time invested. ↩︎
But I still prefer speakers for sports anime like Haikyu and lighthearted comedy like Gintama. ↩︎
I’ve watched plenty of shows just for the moe, so I won’t pretend like my scores are that rigorous. ↩︎

Undeletable data storage

I buy stuff when I’m bored. Sometimes, I buy random computer parts to spark project ideas. This usually works really well, but when I bought a 2TB external hard drive the week before thanksgiving, I wasn’t really sure what to do with it. At the time, I was in college, and two terabytes was way more storage than I had in my laptop, but I knew that hard drives weren’t the most reliable, so I wasn’t about to put anything important on it. There just wasn’t that much I could do with only one hard drive. I ended up writing a multithreaded file checksum program designed to max out the hard drive’s sequential throughput, and that was that. Years later, after having written a few custom file backup and archival tools, I decided it’d be nice to have an “air gapped” copy of my data on an external hard drive. Since an external hard drive could be disconnected most of the time, it would be protected from accidents, software bugs, hacking, lightning, etc. Furthermore, it didn’t matter that I only had a single hard drive. The data would be an exact replica of backups and archives that I had already stored elsewhere. So, I gave it a try. Once a month, I plugged in my external hard drive to sync a month’s worth of new data to it. But after a few years of doing this, I decided that I had had enough of bending down and fumbling with USB cables. It was time for a better solution.

Essentially, I wanted the safety guarantees of an air gapped backup without the hassle of physically connecting or disconnecting anything. I wanted it to use hard drives (preferably the 3.5” kind), so that I could increase my storage capacity inexpensively. The primary use case would be to store backup archives produced by Ef, my custom file backup software. I wanted it to be easily integrated directly into my existing custom backup software as a generic storage backend, unlike the rsync bash scripts I was using to sync my external hard drive. There were lots of sleek-looking home NAS units on Amazon, but they came bundled with a bunch of third-party software that I can’t easily overwrite or customize. Enterprise server hardware would have given me the most flexibility, but it tends to be noisy and too big to fit behind my TV where the rest of my equipment is. I also considered getting a USB hard drive enclosure with multiple bays and attaching it to a normal computer or Raspberry Pi, but that would make it awkward to add capacity, and the computer itself would be a single point of failure.

Luckily, I ended up finding a perfect fit in the ODROID-HC2 ($54), which is a single board computer with an integrated SATA interface and 3.5” drive bay. Each unit comes with a gigabit ethernet port, a USB port, microSD card slot, and not much else, not even a display adapter. The computer itself is nothing special, given how many ARM-based mini computers exist today, but the inclusion of a 3.5” drive bay was something I couldn’t find on any other product. This computer was made specifically to be a network storage server. It not only allowed a hard drive to be attached with no additional adapters, but also physically secured the drive in place with the included mounting screws and its sturdy aluminum enclosure. So, I bought two units and got started designing and coding a new network storage server. Going with my usual convention of single-letter codenames, I named my new storage service “D”. In case you’re interested in doing the same thing, I’d like to share some of the things I learned along the way.

Undeletable

On D, data can’t be deleted or overwritten. I used external hard drives for years because their data could only be modified while they were actively plugged in. In order to achieve the same guarantees with a network-connected storage device, I needed to isolate it from the rest of my digital self. This meant that I couldn’t even give myself SSH login access. So, D is designed to run fully autonomously and never require software updates, both for the D server and for the operating system. That way, I could confidently enforce certain guarantees in software, such as the inability to delete or overwrite data.

This complete isolation turned out to be more difficult than I expected. I decided that D should consist of a dead simple server component and a complex client library. By putting the majority of the code in the client, I could apply most kinds of bug fixes without needing to touch the server. There are also some practical problems with running a server without login access. I ended up adding basic system management capabilities (like rebooting and controlling hard drive spindown) into the D server itself. The two HC2 units took several weeks to ship to my apartment, so I had plenty of time to prototype the software on a Raspberry Pi. I worked out most of the bugs on the prototype and a few more bugs on the HC2 device itself. Each time I built a new vesion of the D server, I had to flash the server’s operating system image to the microSD card with the updated software¹. It’s now been more than two months since I deployed my two HC2 units, and I haven’t needed to update their software even once.

On-disk format

The D protocol allows reading, writing, listing, and stating of files (”chunks”) up to 8MiB in size. Each chunk needs to be written in its entirety, and chunks can’t be overwritten or deleted after they’re initially created². The D server allows chunks to be divided into alphanumeric directories, whose names are chosen by the client. A chunk’s directory and file name are restricted in both character set and length in order to avoid any potential problems with filesystem limits. My existing backup and archival programs produce files that are hundreds of megabytes or gigabytes in size, so the D client is responsible for splitting up these files into chunks before storing them on a D server. On the D server, the hard drive is mounted as /d_server, and all chunks are stored relative to that directory. There’s also some additional systemd prerequisites that prevent the D server from starting if the hard drive fails to mount.

For each D file, the D client stores a single metadata chunk and an arbitrary number of data chunks. Metadata chunks are named based on the D file’s name in such a way that allows prefix matching without needing to read the metadata of every D file, while still following the naming rules. Specifically, any special characters are first replaced with underscores, and then the file’s crc32 checksum is appended. For example, a file named Ef/L/ef1567972423.pb.gpg might store its metadata chunk as index/Ef_L_ef1567972423.pb.gpg-8212e470. Data chunks are named based on the file’s crc32, the chunk index (starting with 0), and the chunk’s crc32. For example, the first data chunk of the previously mentioned file might be stored as 82/8212e470-00000000-610c3877.

Wire protocol

The D client and server communicate using a custom protocol built on top of TCP. This protocol was designed to achieve high read/write throughput, possibly requiring multiple connections, and to provide encryption in transit. It’s similar to TLS, with the notable exception that it uses pre-shared keys instead of certificates. Since the D server can’t be updated, certificate management would just be an unnecessary hassle. Besides, all the files I planned to store on D would already be encrypted. In fact, I originally designed the D protocol with only integrity checking and no encryption. I later added encryption after realizing that in no other context would I transfer my (already encrypted) backups and archives over a network connection without additional encryption in transit. I didn’t want D to be the weakest link, so I added encryption even if it was technically unnecessary.

The D protocol follows a request-response model. Each message is type-length-value (TLV) encoded alongside a fixed-length encryption nonce. The message payload uses authenticated encryption, which ensures integrity in transit. Additionally, the encryption nonce is used as AEAD associated data and, once established, needs to be incremented and reflected on every request-response message. This prevents replay attacks, assuming sufficient randomness in choosing the initial nonce. The message itself contains a variable-length message header, which includes the method (e.g. Read, Write, NotifyError), a name (usually the chunk name), a timestamp, and the message payload’s crc32 checksum. The message header is encoded using a custom serialization format that adds type checking, length information, and checksums for each field. Finally, everything following the message header is the message payload, which is passed as raw bytes directly up to either the D client or server.

File metadata

Each D file has a single metadata chunk that describes its contents. The metadata chunk contains a binary-encoded protobuf wrapped in a custom serialization format (the same one that’s used for the message header). The protobuf includes the file’s original name, length, crc32 checksum, sha256 hash, and a list of chunk lengths and chunk crc32 checksums. I store the file’s sha256 hash in order to more easily integrate with my other backup and archival software, most of which already uses sha256 hashes to ensure file integrity. On the other hand, the file’s crc32 checksum is used as a component of chunk names, which makes its presence essential for finding the data chunks. Additionally, crc32 checksums can be combined, which is a very useful property that I’ll discuss in the next section.

Integrity checking at every step

Since D chunks can’t be deleted or overwritten, mistakes are permanent. Therefore, D computes and verifies crc32 checksums at every opportunity, so corruption is detected and blocked before it is committed to disk. For example, during a D chunk write RPC, the chunk’s crc32 checksum is included in the RPC’s message header. After the D server writes this chunk to disk, it reads the chunk back from disk³ to verify that the checksum matches. If a mismatch is found, the D server deletes the chunk and returns an error to the client. This mechanism is designed to detect hardware corruption on the D server’s processor or memory. In response, the D client will retry the write, which should succeed normally.

The D client also has includes built-in integrity checking. The D client’s file writer requires that its callers provide the crc32 checksum of the entire file before writing can begin. This is essential not only because chunk names are based on the file’s crc32 checksum, but also because the D client writer keeps a cumulative crc32 checksum of the data chunks it writes. This cumulative checksum is computed using the special crc32 combine algorithm, which takes as input, two crc32 checksums and the length of the second checksum’s input. If the cumulative crc32 checksum does not match the provided crc32 checksum at the time the file writer is closed, then the writer will refuse to write the metadata chunk and instead return an error. It’s critical that the metadata chunk is the last chunk to be written because until the file metadata is committed to the index, the file is considered incomplete.

Because the D client’s file writer requires a crc32 checksum upfront, most callers end up reading an uploaded file twice—once to compute the checksum and once to actually upload the file. The file writer’s checksum requirements are designed to detect hardware corruption on the D client’s processor or memory. This kind of corruption is recoverable. For example, if data inside a single chunk is corrupted before it reaches the D client, then the file writer can simply reupload it with its corrected checksum. Since the chunk checksum is included in the chunk name, the corrupt chunk does not need to be overwritten. Instead, the corrupt chunk will just exist indefinitely as garbage on the D server. Additionally, the D server ignores writes where the chunk name and payload exactly match an exsting chunk. This makes the file writer idempotent.

The D client’s file reader also maintains a cumulative crc32 checksum while reading chunks using the same crc32 combine algorithm used in the file writer. If there is a mismatch, then it will return an error upon close. The file reader can optionally check the sha256 checksum as well, but this behavior is disabled by default because of its CPU cost.

List and stat

In addition to reading and writing, the D server supports two auxiliary file RPCs: List and Stat. The List RPC returns a null-separated list of file names in a given directory. The D server only allows chunks to be stored one directory deep, so this RPC can be used as a way to iteratively list every chunk. The response of the List RPC is also constrained by the 8MiB message size limit, but this should be sufficient for hundreds of thousands of D files and hundreds of terabytes of data chunks, assuming a uniform distribution using the default 2-letter directory sharding scheme and a predominate 8MiB data chunk size. The D client library uses the List RPC to implement a Match method, which returns a list of file metadata protobufs, optionally filtered by an file name prefix. The Match method reads multiple metadata chunks in parallel to achieve a speed of thousands of D files per second on a low latency link.

The Stat RPC returns the length and crc32 checksum of a chunk. Despite its name, the Stat method is rarely ever used. Note that stating a D file is done using the Match method, which reads the file’s metadata chunk. In contrast, the Stat RPC is used to stat a D chunk. Its main purpose is to allow the efficient implementation of integrity scrubbing without needing to actually transfer the chunk data over the network. The D server reads the chunk from disk, computes the checksum, and returns only the checksum to the D client for verification.

Command line interface

The D client is primarily meant to be used as a client library, but I also wrote a command line tool that exposes most of the library’s functionality for my own debugging and maintenance purposes. The CLI is named dt (D tool), and it can upload and download files, match files based on prefix, and even replicate files from one D server to another. This replication is done chunk-by-chunk, which skips some of the overhead cost associated with using the D client’s file reader and file writer. Additionally, the CLI can check for on-disk corruption by performing integrity scrubs, either using the aforementioned Stat RPC or by slowly reading each chunk with the Read RPC. It can also find abandoned chunks, which could result from aborted uploads or chunk corruption. Finally, the CLI also performs some system management functions using special RPCs. This includes rebooting or shutting down the D server, spinning down its disk, and checking its free disk space.

Filesystem

The D server uses a GUID Partition Table and an ext4 filesystem on the data disk with a few custom mkfs flags I found after a few rounds of experimentation. I picked the “largefile” ext4 preset, since the disk mostly consists of 8MiB data chunks. Next, I lowered the reserved block percentage from 5% to 1%, since the disk isn’t a boot disk and serves only a single purpose. Finally, I disabled lazy initialization of the inode table and filesystem journal, since this initialization takes a long time on large disks and I wanted the disk to spin down as quickly as possible after initial installation.

Installation and bootstrapping

The D server is packaged for distribution as a Debian package, alongside a systemd service unit and some installation scripts. The microSD card containing the D server’s operating system is prepared using a custom bootstrapper program that I had originally built for my Raspberry Pi security cameras. It was fairly straightforward to extend it to support the HC2 as well. The bootstrapper first installs the vendor-provided operating system image (Ubuntu Linux in the case of the HC2) and then makes a number of customizations. For a D server, this includes disabling the SSH daemon, disabling automatic updates, configuring fstab to include the data disk, installing the D server debian package, formatting and mounting the data disk, setting the hostname, and disabling IPv6 privacy addresses. Some of these bootstrapping steps can be implemented as modifications to the operating system’s root filesystem, but other steps need to be performed upon first boot on the HC2 itself. For these steps, the bootstrapper creates a custom /etc/rc.local script containing all of the required commands.

Once the microSD card is prepared, I transfer it to the HC2, and the rest of the installation is autonomous. A few minutes later, I can connect to the HC2 using dt to confirm its health and available disk space. The D server also contains a client for my custom IPv6-only DDNS service, so I can connect to the D server using the hostname I chose during bootstrapping without any additional configuration⁴.

Encryption and decryption throughput

Before my HC2 units came in the mail, I tested the D server on an Intel NUC and a Raspberry Pi. I had noticed that the Raspberry Pi could only achieve around 8MB/s of throughput, but I blamed this on some combination of the Pi 3 Model B’s 100Mbps ethernet port and the fact that I was using an old USB flash drive as the chunk storage device. So when the HC2 units finally arrived, I was disappointed to discover that the HC2 too could only achieve around 20MB/s of D write throughput, even in highly parallelized tests. I eventually traced the problem to the in-transit encryption I used in the D wire protocol.

I had originally chosen AES-128-GCM for the D protocol’s encryption in transit. This worked perfectly fine on my Intel NUC, since Intel processors have had hardware accelerated AES encryption and decryption for a really long time. On the other hand, the HC2’s ARM cores were a few generations too old to support ARM’s AES-NI instructions, so the HC2 was doing all of its encryption and decryption using software implementations instead. I created a synthetic benchmark from the D protocol’s message encoding and decoding subroutines to test my theory. While my Intel-based prototype achieved speeds of more than a gigabyte per second, the HC2 could only achieve around 5MB/s in single-core and 20MB/s in multi-core tests. Naturally, this was a major concern. Both the hard drive and my gigabit ethernet home network should have supported speeds in excess of 100MB/s, so I couldn’t accept encryption being such severe bottleneck. I ended up switching from AES to ChaCha20-Poly1305, which as supposedly promoted specifically as a solution for smartphone ARM processors lacking AES acceleration. This change was far more effective than I expected. With the new encryption algorithm, I achieved write speeds of more than 60MB/s on the HC2, which is good enough for now.

Connection pools and retrys

The D client is designed to be robust against temporary network failures. All of the D server’s RPCs are idempotent, so basically all of the D client’s methods include automatic retrys with exponential backoff. This includes read-only operations like Match and one-time operations like committing a file’s metadata chunk. For performance, the D client maintains a pool of TCP connections to the D server. The D protocol only allows one active RPC per connection, so the pool creates and discards connections based on demand. The file reader and matcher use multiple threads to fetch multiple chunks in parallel. However, the file writer can only upload a single chunk at a time, so callers are expected to upload multiple files simultaneously in order to achieve maximum throughput. D client operations don’t currently support timeouts, but there is a broadly applied one minute socket timeout to avoid operations getting stuck forever on hung connections. In practice, connections are rarely interrupted since all of the transfers happen within my apartment’s LAN.

Hard drive and accessories

I bought two Seagate BarraCuda 8TB 5400RPM hard drives from two different online stores to put in my two HC2 units. These were around $155 each, which seemed to be the best deal on 3.5” hard drives in terms of bytes per dollar. I don’t mind the lower rotational speed, since they’re destined for backup and archival storage anyway. I also picked up two microSD cards and two 12V barrel power supplies, which are plugged into two different surge protectors. Overall, I tried to minimize the likelihood of a correlated failure between my two HC2 units, but since all of my D files would strictly be a second or third copy of data I had already stored elsewhere, I didn’t bother being too specific with this.

Pluggable storage backends

To make it easier to integrate D into my existing backup and archival programs, I defined a generic interface in Go named “rfile” for reading, writing, and listing files on a storage backend. I ported over my existing remote storage implementation based on Google Cloud Storage buckets, and I wrote brand new implementations for the local filesystem as well as the D client. Rfile users typically pick a storage backend using a URL. The scheme, host, and path are used to pick a storage particular storage backend and initialize it with parameters, such as the GCS bucket name, D server address, and an optional path prefix. For example, d://d1/Ef/ and file:///tmp/Ef/ might be storage backends used by Ef. I also wrote a generic class to mirror files from one storage backend to one or more other backends. Ef uses this mechanism to upload backups from its local filesystem store (using the file backend) to both GCS (using the GCS backend) and two D servers (using the D backend). I can also use this mechanism to replicate Ef backups from one remote backend to another. Luckily, Google Cloud Storage supports the same Castagnoli variant of crc32 that D uses, so this checksum can be used to ensure integrity during transfers between different backends.

Security isolation

Like I mentioned earlier, I can’t log in to my HC2 units since they don’t run SSH daemons. I’ve also configured my router’s firewall rules to explicitly drop all traffic destined to my D servers other than ICMPv6 and TCP on the D server’s port number⁵. So, the attack surface for the D servers is essentially just the kernel’s networking stack and the D server code itself. My router also ensures that only computers on my LAN are allowed to connect via the D server protocol, and the D protocol itself ensures maximum network distance by tightly limiting the maximum round trip time allowed while establishing a connection.

The D server is written in Go and is simple enough to mostly fit in a single file, so I’m not particularly concerned about hidden security vulnerabilities. Even if such a vulnerability existed, the chunk data is mostly useless without the application-layer encryption keys used to encrypt the backups and archives stored within them. Plus, any attacker would first need a foothold on my apartment LAN to even connect to the D servers. Admittedly, this last factor is probably not that hard given all of the random devices I have plugged in to my network.

Conclusion

Currently, I’ve integrated D into two of my existing backup and archival programs. They’ve stored a total of 900GiB on each of my two HC2 servers. It’ll be a long time before I come close to filling up the 8TB hard drives, but when the time comes, my options are to either buy bigger hard drives or to get a third HC2 unit. Let’s assume that I still want every file stored in D to be replicated on at least 2 D servers. If I get a third HC2 unit, then load balancing becomes awkward. Since D chunks can’t be deleted, my optimal strategy would be to store new files with one replica on the new empty D server and the other replica on either of the two existing D servers. Assuming a 50/50 split between the two existing D servers, this scheme would fill up the new empty D server twice as quickly as the two full ones. Maybe in the future, I’ll create additional software to automatically allocate files to different D servers depending on their fullness while also trying to colocate files belonging to a logical group (such as archives of the same Ef backup set).

Well, I eventually gave in and just added a flag to the bootstrapper that enabled SSH access, but I only used this during development. ↩︎
The O_EXCL flag to open can be used to guarantee this without races. ↩︎
Well, from the buffer cache probably. Not actually from disk. ↩︎
To be honest, the DDNS client is probably an unnecessary security risk, but it’s really convenient and I don’t have a good replacement right now. ↩︎
This probably broke the DDNS client, but my apartment’s delegated IPv6 prefix hasn’t changed in years, so I’ll just wait until the D servers stop resolving and then lift the firewall restriction. ↩︎

The case for pro-death

Political polarization is a hot topic right now. People say it comes from misinformation or social media, but I think it goes deeper than that. So today, I’ll be playing devil’s advocate and explaining the perspective from the other side. I’ll tell you why they always seem to be not only wrong, but also stupid. I’ll show you that sometimes a difference of opinion is just a difference in values, so hopefully the next time you get the urge to high five your whole face, things will make a little more sense. You see, I’ve written before about how not everyone is interested in the truth, but I’ve also discovered that not everyone is interested in life either. By that, I mean things like helping people, reducing suffering, and generally preventing unnecessary deaths as much as possible. Hero work, you know? With so many people whining about climate change, racial injustice, and a global pandemic, it becomes harder and harder with each passing day for the pro-death party to get their point across.

More than a hundred thousand people die everyday. I’m not talking about the ongoing pandemic. That’s just how many people need to die every day if you want to reach a total of seven billion after a hundred years or so. Most of them aren’t even dying from communicable diseases. They’re dying from heart disease or cancer or car accidents. Luckily, most of these people are strangers (just one of the perks of not having many friends). Their deaths don’t affect you. People have been dying in droves for hundreds of thousands of years, and there’s nothing you can do in the short term to change that. That’s why the pro-death party doesn’t care if a few thousand people die each year for unusual reasons like hate crimes or genocide. The number of victims is just too small to compare. But do you know what’s even worse than hundreds of thousands of deaths? A mild inconvenience for millions of living people.

Staying home means putting your life on pause. Even if you avoid getting sick, you’re still getting older. If you asked the whole country to stay home for six months, wouldn’t that basically be the same as murdering 1% of the population? Saving lives is important, but people want to know that they’re saving more lives than the equivalent amount of time they’re wasting at home. Being at home is basically the same as being dead, after all. That’s why the pro-death party wants to reopen. It doesn’t matter to them whether staying home really can save lives, since everybody is already dying slowly day by day. If it were up to them, everyone would just hurry up and go outside and get infected and die. That way, the lucky survivors could get on with their lives. Luckily, the virus doesn’t have any long-term effects or complications. I’ve heard it’s basically just like the flu, right?

Foggy evening at San Gregorio State Beach, before the pandemic started.

All the official guidance about face coverings and social distancing are based on the misconception that human lives are more valuable than haircuts. Of course, saving one life is probably worth itchy sideburns for a few months, but it all comes down to the conversion factor. Is saving a hundred lives worth skipping a million haircuts? The pro-death party doesn’t think so. It’s not that the lives don’t matter, but just that haircuts matter too. All haircuts matter. Besides, what’s the point of life if the beach is closed? I mean, it’s no big deal entertaining yourself in the backyard if it’s only for the weekend. But a whole fricken summer? That’s too steep a price to pay merely to save a bunch of immunocompromised retirees who were probably on the way out already. Anyways, isn’t having fun outside while gambling away your health basically the very essence of youth?

That’s why so many people have lost faith in our institutions. Climate scientists and health professionals keep telling us they’re trying to save lives, but they’ll stop at nothing to do it. Who even are these victims anyway? For all we know, they could be poor or brown or democrats. Until these so-called “experts” get their priorities straight, why should anyone listen to them? Fortunately, there’s been a major shift in our political leadership lately. Well, it’s about time! We need a government that makes everyone feel adequately represented, even bigots and racists. We need strong leaders to protect our civil liberties, restart our national economy, discontinue mandatory autism injections, and stop the proliferation of 5G mind control. If you agree, I urge you to join our death cult—I mean political party, and make America great again!

Sorry, this whole post was satire. If you resonated with any of it, then you’re a moron. ↩︎

Yet another Mac

Last year, I sold my laptop. I owned it for about 15 months, and in that time, I took it outside on maybe three occasions. The laptop hardly left my desk. I didn’t even open the lid, because most of the time, it was just plugged into an external monitor, speakers, keyboard, and mouse. So, I decided to ditch the laptop on craigslist and get a Mac mini instead. My first Mac mini broke, so I ordered another. The process was really simple. There’s three different Apple Stores within 15 minutes driving distance of my apartment, and all of them stock my base model Core i5 Mac mini. There’s even a courier service where Postmates will hand-deliver Apple products right to your door within 2 hours if you live close enough. I liked the extra processor cores, the extra I/O ports, the smaller desk footprint, and the lack of laptop-specific problems (like battery health, a fragile keyboard, or thermal constraints). So, for the first time in a while, I didn’t own a laptop¹ and everything was fine. But over the holidays, I found the old abandoned MacBook Pro² that I left at my parent’s house, and since I didn’t have a computer, I reclaimed it.

Until now, I had only ever used one Mac and one Linux machine. I used the Mac for web browsing and writing code and watching videos, while I used the Linux machine for building code and staging big uploads. My workflow didn’t require two Macs, so I hadn’t really thought about how an extra laptop could be incorporated. I want to use the Mac mini while I’m at home, but also get an equivalent experience on the MacBook for the few occasions I need to leave the house with a computer. Luckily, I’ve worked on a few projects over the years that made it really easy to expand from one Mac to two. Specifically, I want:

The same installed software and settings
Bi-directional syncing of my append-only folder of memes and screenshots
My git repositories of source code and config files

The third item is pretty easy, since I’ve always pushed my git repositories to the home folder on my Linux machine (where it gets backed up into Ef), so I’ll just talk about how I accomplish the first two with a few bits of custom software.

Provisioning a Mac

When talking about data backups, I think a lot of people overlook their installed software and settings. “Setting up” your new computer might sound like a lot of fun, but since I re-install macOS on a 6-month cadence, it’d be tedious to re-install the same programs and configure the same settings every single time. Plus, a lot of Mac system settings don’t automatically sync between your computers, which means an inconsistent experience for anyone with multiple Macs. Luckily, there’s Homebrew and other software package managers for macOS. But my personal Mac provisioner “mini” goes one step further³.

I wrote mini with three goals. First, it should have a very realistic “dry run” mode, which will show me proposed changes before actually doing them. Second, it should be able to recover from most kinds of errors, because until my Mac is set up, I’d probably have a hard time fixing any bugs in its code. Third, it should automate every part of Mac setup that can reasonably be automated. I’ve seen lots of host provisioner software with superficial or missing “dry run” modes, which means running the provisioner carries a lot of risk of breaking your system. I wanted mini to be executed on a regular basis, so I could use it to keep the software on both of my Macs in sync.

Among the thing that mini does are:

Install homebrew packages (and homebrew itself)
Link my dotfiles
Clone my git repositories and set their remotes appropriate
Extend the display sleep timer to 3 hours while on A/C power
Choose between natural scrolling (MacBook) and reverse scrolling (Mac mini)
Increase my mouse tracking and scrolling speed
Set my Dock preferences and system preferences
Configure passwordless sudo for the Mac firewall CLI
Disable IPv6 temporary addresses
Prepare GnuPG to work with my YubiKey
Change the screenshot image format from png to jpeg

When I set up a new Mac, the first thing I do is download the latest build of mini from my NUC and run it. Once it’s done, I’ll have my preferred web browser and all of my source code on the new machine, so I can continue with all of the provisioning steps that can’t be automated (but can be documented, of course). This includes things like installing apps from the Mac App Store and requesting certificates from my personal PKI hierarchy.

Like its predecessor, mini is written in Python. But since it’s no longer open source, I’m able to make use of all of my personal Python libraries for things like plist editing, flag parsing, and formatted output. I package this code into a single file using Bazel and the Google Subpar project, so it can be easily downloaded via HTTP to a new Mac.

Most of mini’s functionality is bundled into Task classes, which implement one step of the provisioning process (like installing a Homebrew package). There are no dependencies allowed between tasks, so the Homebrew package installation task will also install Homebrew itself if needed. All of the Task classes provide Done() → bool and Do(dry_run: bool) methods. Additional arguments can be passed to the constructor. Additionally, every task has a unique repr value, which is used for logging and to allow me to skip particular tasks with a command-line flag.

For consistency, mini provides utility functions for running commands, taking ownership of files, and asking for user input. Some of these functions also implement their own dry run modes, so Task classes can decide which commands to run and which to just simulate. There’s also a framework for trace logging, which collects useful information for error reporting and debugging. There are lots of places where steps can be retried or skipped, which improves mini’s robustness against unexpected errors.

Syncing memes

I’ve always had a lot of skepticism about file syncing software. For example, if your sync client wakes up to find a file missing, then it’ll assume that you deleted it, which causes the deletion to be synced to your other computers. You won’t even notice this happening unless you regularly look at your sync activity. I think file syncing should be deliberate, so a human gets a chance to approve the changes, and explicit, so syncing only occurs when files aren’t being actively changed. Git repositories already have these two properties, since changes aren’t synced until you commit and push them. But file syncing software tends not to require this level of attention.

My file syncing needs are a bit unique, because I only add files. I never delete files, and I rarely ever edit files after the first version. These limitations make it easier to guarantee that my files aren’t corrupted, no matter how many times they’re transferred to new computers over the years. These files include scanned receipts, tax documents, medical records, screenshots, and lots of memes. In other words, they’re mostly small media files. On the other hand, I store frequently changed files in my git repositories, and I store big media files in my reduced redundancy storage system (so they can be offloaded when not needed). Previously, I just synced these files from my Mac to my Linux machine, where they’re backed up into Ef. Now, with the addition of my new Mac, I need to ensure that both of my Mac computers have updated copies of these files.

File syncing starts with the ingestion of new data into the sync folder. My “bar” script identifies new files on my desktop or in my Downloads folder, then moves them into the appropriate subfolder. This step also includes renaming files as needed (for example, timestamp prefixes for scanned documents, in order to ensure that file names are unique). Next, I pull down any new files from my Linux machine using rsync and the ignore-existing flag. I also pull down the SHA1 sums file from my Linux machine and put that in a temporary directory.

All changes to my sync folder need to be approved. My “foo” tool shows the proposed changes, represented as a delta on the SHA1 sums file. Any parts of the delta that are already included in the Linux machine’s sums file are marked as “expected”, since these reflect changes pushed by a different Mac. If the delta is accepted, then foo updates the local sums file and pushes changes to the Linux machine using rsync.

There are a lot of extra features baked into foo. For example, it identifies duplicate files, in case I accidentally save something under two different names. In addition to SHA1 sums, foo also stores file size and modification times, which it uses as a performance optimization to avoid reading unchanged files. Also, foo is generally a very fast multi-threaded file checksum program, which lets me quickly check files for data corruption. Did I mention it also supports GnuPG signing and verification of sum files? Anyway, let’s move on.

The sums file is the only file in the sync folder that’s ever modified in place, so it’s easy to tell at a glance when unexpected modifications or deletions occur. Additionally, ef also uses an append-only data model, which makes it easy to recover old versions of files.

I also added an offline mode to the file syncing process, which skips the parts that require communication with the Linux machine. I run the whole process every night (when there’s new data to add) before bed, so the script is aptly named bedtime. With the addition of my new Mac, I run bedtime on my laptop when I turn it on, to get it caught up with the latest changes. If there’s new software to be installed, I’ll sync down a fresh build of mini and run that. And maybe I’ll give my git repositories a pull if I’m working on code. Altogether, these pieces make up a pretty robust syncing system for my two Macs that meets my standard for data integrity. How does it compare with your own methods? Let me know in the comments below.

Well, I’m not counting my work Chromebook or my Windows gaming laptop or my old ThinkPad that still works. ↩︎
It’s a 2015 model, which means it still has the good keyboard. ↩︎
The previous version of mini was open-source and called macbok. ↩︎

Tragedy and loss

Among my favorite anime, there are a few that I never ever recommend to others. It’s not the weird ones. It’s not even just the ecchi. I’m talking about shows that are so precious to me that I’d probably get upset if my recommendation just fell flat. But can you really blame them? Every work of art deserves at least your undivided attention, and yet, I’m guilty too. I’ll gladly watch seasonal shows while also playing Mario Kart or looking at my phone. In a pinch, I’ll even watch them on my phone. There’s so many new shows airing every week, and a lot of them are mediocre and uninspired. How do you expect anyone to keep up while also staying focused through all of them? But this seasonal grind just makes it even more special when a show goes above and beyond, grabs your attention, and forces you to watch even as you’re fighting sleep deprivation. That’s why it’s so frustrating. It’s easy to forget that every single show is the product of thousands of hours of hard work by hundreds of talented artists. People develop such deep emotional connections with their favorite anime, and when someone else just casually watches without the profound respect and appreciation commensurate with this level of accomplishment, it feels a little rude.

This May, I was fortunate enough to visit Uji, Kyoto during Golden Week, which is Japan’s annual week of national holidays. Aside from their well-known green tea, Uji is the home of Kyoto Animation. It also happens to be the inspiration for the setting of Hibike! Euphonium, one of my absolute favorite anime series. Of course, things in real life never look as good as they do in anime, but I was fascinated by the prospect of visiting. The town is centered around the Uji River, which flows south to north toward Kyoto. The Keihan Uji railway line follows this river as it passes by many of the stores, train stations, parks, and playgrounds featured in the show as well, as the studios of KyoAni themselves.

Uji-bashi bridge in Uji, Japan.

The staff at KyoAni are masters in nuanced storytelling and creating memorable settings with incredibly detailed background art. From its depiction in the show, Uji already felt like a familiar place to me before I had ever been there. I knew all the bridges, with their unique wood and stone construction. I knew the ins and outs of Rokujizo train station, with its enigmatic pair of pedestrian crossing buttons. I even knew the specific bench along the river bank where our main character likes to sit on the way home after school. These things are all a lot farther away from each other in reality than the show suggests, which is how I ended up accidentally walking 11km to complete my tour. But I enjoyed every moment of it. Unlike in other tourist destinations, I felt a personal connection to everything in Uji from the utility poles to the tiles on the ground. Eupho isn’t about heroes or prophecies. It’s a story about regular people living in a regular place in which nothing particularly unusual happens. Yet, KyoAni’s adaptation made these places special, not by exaggerating their natural qualities, but by framing them around unforgettable emotional scenes. In this light, even the smallest mundane things become precious. You can tell that the people of Uji really love their city, and now millions of fans across the world do too.

In fact, I encountered several other fans doing the exact same pilgrimage as me. They were all fairly easy to spot. In one case, I saw the same guy multiple times throughout the day, at the KyoAni store and near multiple bridges and shrines from the show. We talked a bit in broken Japanese and English. I also saw various people doing peculiar things, like photographing an empty bench. Or getting off the train, walking to a nearby bakery, inspecting the sign on the door, discovering it was closed for the holidays, and then heading back to the station. I’m sure they spotted me as well, as I left a train station to photograph a pedestrian crossing and then promptly headed back onto the train.

As the sun sank lower in the sky, I made my way up to the Daikichiyama observation deck to watch the sunset. On the way, I passed by two Shinto shrines, both of which appeared in the show and both of which were decorated with promotional materials about the Chikai no Finale sequel movie that was currently screening in local theaters (which I did watch that evening). The secluded mountain location and the iconic square gazebo made the deck feel like an actual sacred site for fans. I met another westerner at the top of the hike who was an exchange student doing physics research in Osaka. We chatted about the impact KyoAni had on our lives as we admired a familiar view of the city below.

The view from Daikichiyama observation deck.

For me, Eupho came a few months before I was scheduled to graduate from college, at a time when I was anxious about the future and about turning my hobby into a career. I was uncertain whether I’d continue to enjoy computer programming once it became work, and I questioned whether I was already stuck just mindlessly doing the one and only thing I was good at. This show could not have come at a better time for me. I watched as the main character, Kumiko, half-heartedly joins her school concert band, playing the same instrument she played throughout middle school, and as she gradually comes to understand the motivation of her peers in order to develop her own sense of purpose. With such a classic premise, I was entirely unprepared for the emotional impact this series would have. The scene in which Kumiko runs across Uji-bashi Bridge feeling a mix of frustration and renewed understanding brought me to tears. KyoAni brings together the perfect combination of incredible character animation, breathtaking background art, flawless compositing, compelling voice acting, and phenomenal writing to tell incredibly relatable and heartbreaking stories. There are no shortcuts and no cheap tricks here. KyoAni’s dedication to their craft is well known, and they deserve every bit of praise they get for it.

It’s hard to express just how beloved KyoAni is in the anime community. They aren’t yet a household name like Studio Ghibli, but they’re hardly just another anime studio. Their moe shows in the late 2000s influenced art styles throughout the entire industry, and their new releases are consistently ranked among the most anticipated series every season among fans. Bookstores in Kyoto are especially proud of their local animation shop. I saw KyoAni art books, character goods, and DVDs prominently displayed front and center in every bookstore I went to. This blog post has only been about my experience with one KyoAni show, but there are so many more I count among my favorites: Violet Evergarden, K-On!, Tamako Market, Dragon Maid, Kyoukai no Kanata, Amaburi, Haruhi, and Clannad. Plus, the studio’s first standalone feature-length film, Koe no Katachi, is my favorite movie. I’m so thankful KyoAni and their works exist in our world, and I’m glad to have been born in a time and place where people can so readily experience them.

Kumiko's bench in Uji, Japan (approximately 34.889612, 135.808309).

Given their massive influence, it’s easy to overlook how small a company Kyoto Animation really is, both in terms of people and revenue. I was surprised to see their iconic yellow studios so discreetly embedded in quiet suburban neighborhoods. Japan is one of the safest countries in the world, and Uji is the epicenter of so much love and goodwill from fans, so I just don’t understand. I don’t understand why I’m hearing what I’m hearing. This can’t possibly be the same Uji I loved visiting so much. It just doesn’t make any sense. It doesn’t. KyoAni is well known for caring so much about their employees, opting to pay salaries and to hire in-house in betweeners. I can’t imagine how they must feel right now. I can’t fathom the extent of what’s been lost in one senseless tragedy. I can’t understand how we’ll get by if this disappears from the world. Please, take care of yourselves.

A data backup system of one’s own

One of the biggest solo projects I’ve ever completed is Ef, my custom data backup system. I’ve written about data loss before, but that post glazed over the software I actually use to create backups. Up until a year ago, I used duplicity to back up my data. I still think duplicity is an excellent piece of software. Custom backup software had been on my to-do list for years, but I kept postponing the project because of how good my existing solution was already. I didn’t feel confident that I could produce something just as sophisticated and reliable on my own. But as time passed, I decided that until I dropped third-party solutions and wrote my own backup system, I’d never fully understand the inner workings of this critical piece of my personal infrastructure. So last June, I started writing a design doc, and a month later, I had a brand new data backup system that would replace duplicity for me entirely.

This remains one of the only times I took time off of work to complete a side project. Most of my projects are done on nights and weekends. At the beginning, I made consistent progress by implementing just one feature per night. Every night, I’d make sure the build was healthy before going to bed. But it soon became clear that unless I set aside time dedicated to this project, I’d never achieve the uninterrupted focus to make Ef as robust and polished as I needed it to be¹. Throughout the project, I relied heavily on my past experience with filesystems and I/O, as well as information security, concurrent programming, and crash-consistent databases. I also learned a handful of things from the process that I want to share here, in case you’re interested in doing the same thing!

Scoping requirements

I think it’s a bad idea to approach a big software project without first writing down what the requirements are (doubly so for any kind of data management software). I started out with a list of what it means to me for files to be fully restored from a backup. In my case, I settled on a fairly short list:

Directory structure and file contents are preserved, of course
Original modification times are replicated
Unix modes are intact
Case-sensitive

In particular, I didn’t want Ef to support symbolic links or file system attributes that aren’t listed here like file owner, special modes, extended attributes, or creation time. Any of these features could easily be added later, but starting with a short list that’s tailored to my personal needs helped fight feature creep.

I also wrote a list of characteristics that I wanted Ef to have. For example, I wanted backups to be easily uploaded to Google Cloud Storage, which is where I keep the rest of my personal data repositories. But I also wanted fast offline restores. I wanted the ability to do dry run backups and the ability to verify the integrity of existing backups. I wanted to support efficient partial restores of individual files or directories, especially when the backup data needed to be downloaded from the cloud. Finally, I wanted fast incremental backups with point-in-time restore and I wanted all data to be securely encrypted and signed at rest.

Picking technologies

Do you want your backups to be accessible in 30 years? That’s a tall order for most backup systems. How many pieces of 30 year old technology can you still use today? When building Ef, I wanted to use a platform that could withstand decades of abuse from the biting winds of obsolescence. I ended up picking the Go programming language and Google’s Protocol Buffers (disclaimer: I also work here) as the basis of Ef. These are the only two dependencies that could realistically threaten the project’s maintainability into the future, and both are mature well-supported pieces of software. Compare that to the hot new programming language of the month or experimental new software frameworks, both of which are prone to rapid change or neglect.

I also picked Google Cloud Storage as my exclusive off-site backup destination. Given its importance to Google’s Cloud business, it’s unlikely that GCS will ever change in a way that affects Ef (plus, I sit just a few rows down from the folks that keep GCS online). Even so, it’s also easy enough to add additional remote backends to Ef if I ever need to.

For encryption, I went with GnuPG, which is already at the center of my other data management systems. I opted to use my home grown GPG wrapper instead of existing Go PGP libraries, because I wanted integration with GPG’s user keychain and native pinentry interfaces. GPG also holds an irreplaceable position in the open source software world and is unlikely to change in an incompatible way.

Naming conventions

Ef backs up everything under a directory. All the backups for a given directory form a series. Each series contains one or more backup chains, which are made up of one full backup set followed by an arbitrary number of incremental backup sets. Each backup set contains one or more records, which are the files and directories that were backed up. For convenience, Ef identifies each backup set with an auto-generated sequence number composed of a literal “ef” followed by the current unix timestamp. All the files in the backup set are named based on this sequence number.

These naming conventions are fairly inflexible, but this helps reduce ambiguity and simplify the system overall. For example, series names must start with an uppercase letter, and the name “Ef” is blacklisted. This means that sequence numbers and Ef’s backup directory itself cannot be mistaken for backup data sources. Putting all the naming logic in one place helps streamline interpretation of command-line arguments and validation of names found in data, which helps eliminate an entire class of bugs.

On-disk format

Ef creates just two types of backup files: metadata and archives. Each metadata file is a binary encoded protobuf that describes a backup set, its records, and its associated archives. If a record is a non-empty file, then its metadata entry will point to one or more contiguous ranges in the archives that constitute its contents. Individual archive ranges are sometimes compressed, depending on the record’s file type. Each archive is limited to approximately 200MB, so the contents of large records are potentially split among multiple archives. Finally, both the metadata and archives are encrypted and signed with GPG before being written to disk.

Incremental backup sets only contain the records that have changed since their parent backup set. If a file or directory was deleted, then the incremental backup set will contain a special deletion marker record to suppress the parent’s record upon restore. If only a record’s metadata (modification time or Unix mode) changes, then its incremental backup record will contain no archive pointers and a non-zero file size. This signals to Ef to look up the file contents in the parent backup set.

Backup replication is as easy as copying these backup files to somewhere else. Initially, backup files are created in a staging area in the user’s home directory. Ef has built-in support for uploading and downloading backup files to/from a Google Cloud Storage bucket. For faster restores, Ef can also extract individual records from a backup set stored on GCS without first writing the backup files to disk.

Metadata cache

In addition to the backup files, Ef maintains a cache of unencrypted backup set metadata in the user’s home directory in order to speed up read-only operations. This cache is a binary encoded protobuf containing all of the metadata from all known backup sets. Luckily, appending binary protobufs is a valid way to concatenate repeated fields, so new backup sets can simply append their metadata to the metadata cache upon completion. Naturally, the metadata cache can also be rebuilt at any time using the original backup metadata files. This is helpful when bootstrapping brand new hosts and in case the metadata cache becomes corrupted.

Archiving

The Ef archiver coordinates the creation of a backup set. Conceptually, the archiver maintains a pipeline of stream processors that transform raw files from the backup source into encrypted, compressed, and checksummed archive files. Ef feeds files and directories into the archiver, one at a time. As the archiver processes these inputs, it also accumulates metadata about the records and the generated archives. This metadata is written to the metadata file once the backup is complete. The archiver also features a dry run mode, which discards the generated archives instead of actually writing them to disk. To support incremental backups, the archiver optionally takes an list of preexisting records, which are generated by collapsing metadata from a chain of backup sets.

The components of the archiver pipeline include a GPG writer, a gzip writer, two accounting stream filters (for the archive and for the record range), and the file handle to the output file itself. Although the archiver itself processes files sequentially, Ef achieves some level of pipeline parallelism by running encryption in a separate process (i.e. GPG) from compression and bookkeeping, which are performed within Ef. The archiver’s destination is technically pluggable. But the only existing implementation just saves archives to the user’s home directory.

Slow mode

One of my original requirements for Ef was fast incremental backups. It’d be impossible to deliver this requirement if an incremental backup needed to read every file in the source directory to detect changes. So, I borrowed an idea from rsync and assumed that if a file’s size and modification time didn’t change, then its contents probably didn’t either. For my own data, this is true in virtually all cases. But I also added a “slow” mode to the archiver that disables this performance enhancement.

Unarchiving

As you guessed, the Ef unarchiver coordinates the extraction of files and directories from backups. The unarchiver contains mostly the same pipeline of stream processors as the archiver, but in reverse order. Because the records extracted by the unarchiver can come from multiple backup sets, the unarchiver is fed a list of records instead. All desired records must be provided upfront, so the unarchiver can sort them and ensure directories are processed before the files and subdirectories they contain. The unarchiver also features a dry run mode, which makes the unarchiver verify records instead of extracting them.

The unarchiver takes a pluggable source for archive data. Ef contains an implementation for reading from the user’s home directory and an implementation for reading directly from Google Cloud Storage. Since the unarchiver is only able to verify the integrity of records, Ef contains additional components to verify archives and metadata. Since records and archives are verified separately, a verification request for a full backup set involves two full reads of the archives: once to verify that each record’s data is intact and once to verify that the archives themselves are intact. Verification of an incremental backup set is only slightly faster: record verification will still require reading archives older than the backup set itself, but archive verification will only need to verify the archives generated as part of the incremental backup set. Usually, I run verification once in record only mode and again in chain archive only mode, which verifies the archives of all backup sets in a backup chain.

Google Cloud Storage

This isn’t my first Go program that uploads data to Google Cloud Storage, so I had some libraries written already. My asymmetric home internet connection uploads at a terribly slow 6Mbps, so as usual, I need to pick a concurrency level that efficiently utilizes bandwidth but doesn’t make any one file upload too slowly, to avoid wasted work in case I need to interrupt the upload. In this case, I settled on 4 concurrent uploads, which means that each 200MB archive upload takes about 5 minutes.

I usually let large uploads run overnight from my headless Intel NUC, which has a wired connection to my router. That way, uploads don’t interfere with my network usage during the day, and I don’t have to leave my laptop open overnight to upload stuff. If you have large uploads to do, consider dedicating a separate machine to it.

GCS supports sending an optional CRC32C for additional integrity on upload. It was easy enough have Ef compute this and send it along with the request. Theoretically, this would detect memory corruption on whichever GCS server was receiving your upload, but I’ve never seen an upload fail due to a checksum error. I suppose it’s probably more likely for your own computer to fumble the bits while reading your backup archives from disk.

GPG as a stream processor

Fortunately, I had already finished writing my GPG wrapper library before I started working on Ef. Otherwise, I’m certain I wouldn’t have finished this project on time. Essentially, my GPG library provides a Go io.Reader and io.Writer interface for the 4 primary GPG functions: encryption, decryption, signing, and verification. In most cases, callers will invoke two of these functions together (e.g. encryption and signing). The library works by launching an GPG child process with some number of pre-attached input and output pipes, depending on which functions are required. The “enable-special-filenames” flag comes in handy here. With this flag, you can substitute file descriptor numbers in place of file paths in GPG’s command line arguments. Typically, files 3 to 5 are populated with pipes, along with standard output. But standard input is always connected to the parent process’s standard input, in order to support pinentry-curses.

Both the GPG reader and writer are designed to participate as stream processors in a I/O pipeline. So, the GPG reader requires another io.Reader as input, and the GPG writer must write to another io.Writer. If the GPG invocation requires auxiliary streams, then the GPG library launches goroutines to service them. For example, one goroutine might consume messages from the “status-fd” output in order to listen for successful signature verification. Another goroutine might simply stream data continuously from the source io.Reader to the input file handle of the GPG process.

Modification time precision

Modern filesystems, including ext4 and APFS, support nanosecond precision for file modification times. However, not all tools replicate this precision perfectly². Rather than trying to fix these issues, I decided that Ef would ignore the subsecond component when comparing timestamps for equality. In practice, precision higher than one second is unlikely to matter to me, as long as the subsecond truncation is applied consistently.

Command line interface

In case it wasn’t obvious, Ef runs entirely from the command line with no server or user interface component. I’m using Go’s built-in command line flag parser combined with the Google subcommands library. The subcommands that Ef supports include:

cache: Rebuild the metadata cache
diff: Print delta from latest backup set
history: Print all known versions of a file
list: Print all backup sets, organized into chains
pull: Download backup files from GCS
purge: Delete local copy of old backup chains³
push: Upload backup files to GCS
restore: Extract a backup to the filesystem
save: Archive current directory to a new backup set
series: Print all known series
show: Print details of a specific backup set
verify: Verify backup sets

Bootstrapping

Once the software was written, I faced a dilemma. Every 6 months, I reinstall macOS on my laptop⁴. When I still used duplicity, I could restore my data by simply installing duplicity and using it to download and extract the latest backup. Now that I had a custom backup system with a custom data format, how would I obtain a copy of the software to restore from backup?

My solution has two parts. Since I build Ef for both Linux and macOS, I can use my Intel NUC to restore data to my laptop and vice versa. It’s unlikely that I’ll ever reinstall both at the same time, so this works for the most part. But in case of an emergency, I stashed an encrypted copy of Ef for Linux on my Google Drive. Occasionally, I’ll rebuild a fresh copy of Ef and reupload it, so my emergency copy keeps up with the latest features and bug fixes. The source code doesn’t change very rapidly, so this actually isn’t much toil.

Comparison

Now that I’ve been using Ef for about 9 months, it’s worth reflecting on how it compares to duplicity, my previous backup system. At a glance, Ef is much faster. The built-in naming conventions make Ef’s command line interface much easier to use without complicated wrapper scripts. Since duplicity uses the boto library to access GCS, duplicity users are forced to use GCS’s XML interoperability API instead of the more natural JSON API. This is no longer the case with Ef, which makes authentication easier.

I like that I can store backup files in a local staging directory before uploading them to GCS. Duplicity allows you to back up to a local directory, but you’ll need a separate solution to manage transfer to and from remote destinations. This is especially annoying if you’re trying to do a partial restore while managing your own remote destinations. I can’t be sure which backup files duplicity needs, so I usually download more than necessary.

There are a few minor disadvantages with Ef. Unlike duplicity, Ef only stores files in their entirety instead of attempting to calculate deltas between subsequent versions of the same file with librsync. This makes Ef theoretically less efficient with storage space, but this makes almost no difference for me in practice. Additionally, there are plenty of restrictions in Ef that make it unsuitable to be used by more than one person. For example, file ownership is intentionally not recorded in the metadata, and timestamp-based sequence numbers are just assumed to be globally unique.

Overall, I’ve found that Ef fits really well into my suite of data management tools, all of which are now custom pieces of software. I hope this post gave you some ideas about your own data backup infrastructure.

In total, Ef represents about 5000 lines of code, not including some personal software libraries I had already written for other projects. I wrote about 60% of that during the week of Independence Day last year. ↩︎
In particular, rsync and stat on macOS don’t seem to consider nanoseconds at all. ↩︎
There is intentionally no support for purging backup chains from GCS, because the cloud is infinite of course. ↩︎
This helps me enforce good backup hygiene. ↩︎

Catching up

Hey reader. Happy new year? It has been eight months since I last posted anything on this blog. Is that the longest hiatus I’ve ever taken? Maybe. Probably not. I enjoyed writing those longer posts about computers and generally being a massive edgelord, but it also takes me literally weeks to finish writing a single one of those. That’s time that I just haven’t been able to set aside recently. But I’m back now. I don’t really have anything in particular that I want to write about, so I guess I’ll just go over a few things that happened since the last time we spoke.

Side hustle

Traffic grew last year. I suppose that means the human population is still increasing. It usually does. I’m not sure what more to say about that, except that I’m glad my website infrastructure is built to easily take care of it. In the weeks leading up to Christmas last year, my website was getting upward of a hundred requests per second, for a total of more than 53 million requests that month (although, that includes all the static assets as well). A big part of that growth has come from Canada, which now makes up about one-third of my traffic in the month of January. I guess they take their final exams later?

In any case, I also added new features to the calculator, like the “Advanced Mode”, which answers like 80% of the questions I’ve been asked in comments over the last few years. I redesigned some parts of the user interface and adopted a new responsive ad format from Adsense¹. On mobile, responsive ads can spill into the left and right margins to take up the full width of the screen. The creatives for responsive ads tend to be higher quality and higher paying (as I’ve unscientifically measured). But sometimes when the ad units fall back to fixed-size creatives, they get left aligned, and then the whole “full width” thing just looks silly.

Most of my work has actually been behind the scenes. I’ve switched to binary encoded protobufs for the backup system, which greatly improved performance and storage space efficiency. I’ve fixed dozens of little bugs. I’ve implemented and migrated to a new object relational mapping system based off the ideas in this post. I scheduled a new background task that does peer-to-peer data integrity checks between replicas (this one had actually been implemented for a while, but the original design consumed too much cross-region network bandwidth to justify the cost). More recently, I’ve been working on automating more of the provisioning and seeding of new replicas to eliminate some of the toil involved with maintaining them.

I’m satisfied with the progress I’ve made over these last few months, but it’s been a constant struggle to prioritize any of this work. After I get home from my regular job, I’m usually tired of doing computer stuff. Even on the weekends, I tend to focus more on shorter projects, like my personal data management infrastructure and my chrome extensions. Last July, I took a whole week off in order to work on Ef, a custom data backup program I made to replace duplicity (I should write more about this in a separate post), because I knew I’d never have enough time to finish it if I didn’t make it a priority. I’m thankful that my schedule’s flexible enough to take time off whenever I need to, but it’s hard to imagine website stuff demanding the same level of urgency.

Changing teams

After two years, I’ve transferred to a different team at Google. I work on Google Cloud Platform now, which is amusing, because I’ve been using Compute Engine to host this website since before I even graduated from college (I’ve written about this before). So, transferring to the Cloud org made a lot of sense. As an insider, I’ve gained even more appreciation for all the engineering that goes into producing a reliable product like Compute Engine. Plus, being both a customer and an engineer offers advantages on both sides. For example, I’m automatically my own 24/7 tech support, and my experience as a customer means I’m more familiar with the different product offerings than most of my peers. I also sit just a few rows down from the folks that run a lot of the systems I rely on for my personal data management infrastructure, which gives me way more confidence in the system’s reliability (perhaps unreasonably so).

As part of the transfer, I traveled to Los Angeles and New York last fall — my first business trip (shudders). In the past, I declined every opportunity for work-related travel since I never felt like the expensive plane tickets and hotels and the inconvenience of working exclusively on a laptop were justified just to meet people that I could already chat with online anyway. But I don’t regret those two trips, especially my two week trip to New York. In that time, I saw four of my high school friends, and I spent more time being a tourist on nights and weekends than I did when I actually visited as a tourist in 2009 (please don’t click). I also met, for the first time, a bunch of the New York based engineers that I had been working with online for several months (and whom I had referred to only by their usernames). Living in Manhattan for ten days also cleared up a lot of misconceptions I had about what the city was like. Now I know that it’s unlikely I’ll ever want to live their permanently.

After coming back from New York, I don’t think I ever got over the jet lag. I’ve been waking up and going to work more than an hour earlier than before. On the whole, this has been great. There’s less traffic early in the morning. There’s also more parking, and this new 8-to-4 schedule means I can leave work while the sun is still shining (yes, even in the winter). Since my new team is in a different office building, I’ve also started driving to work instead of taking the FREE CORPORATE SHUTTLE. It’s nice to have more control over my own schedule, and the whole trip only takes about 10 minutes from parking spot to parking spot. On the downside, the mileage on my grocery hauler has been increasing noticeably faster.

Fake news

I worked at my student newspaper for a few years (although, not as a writer), so I’m somewhat familiar with the art of saying things that are technically true, but totally misleading. There’s all kinds of tricks to it. Like, when you can’t be bothered to look up real statistics, so you say some (i.e. at least one) or many (i.e. at least more than one). Or when you correctly explain that Wi-Fi could cancer, even if there’s no real reason to believe it actually does. Or when people say two unrelated facts that, when presented together, suggest conclusions that are actually totally wrong. And how about that thing where people try to define compound words by what they sound like they should mean? That’s more annoying than misleading, I suppose.

That’s why I’m really glad the CrashCourse folks are publishing a weekly “Navigating Digital Information” series on YouTube. I’m convinced that communicating truth requires not only facts, but also honest intentions. In the last few months, I’ve seen so many of these slimy techniques used in reporting about technology news. Most of the time, it isn’t even intentional. I think a lot of tech journalists are (understandably) not engineers, but just enthusiastic consumers. Since news tends to oversimplify things, a lot of the nuance and context of the software development process tends to get lost. And that would be totally fine, if all journalists were only interested in conveying a fair and honest approximation of the truth. It’s too easy to just dismiss all reporting as clickbait. In reality, lots of honest journalists still want their reporting to be newsworthy, and so nobody really feels bad about crafting facts into a story, even when there really isn’t one. I think a lot of controversy over software bugs and design proposals tends to be unjustified. But until our culture decides that it’s actually unethical to report this way, I’m forced to close this as won’t fix.

Patreon

It’s actually been more than a year since I started using Patreon (as a patron, not a creator). But it was only a few months ago that I really started ramping up how much I paid into it. The creators that I follow are a mix of musicians, YouTube channels² (like CrashCourse), and computer programmers that I think are doing great work. The free market doesn’t always cooperate with people trying to make a career out of their creativity. But that doesn’t mean their work isn’t worth supporting. Of course, Patreon isn’t just donations either. There’s lots of creators that put out really great exclusive content for patrons, and I think contributing directly until they figure out a real path to profitability (or maybe just indefinitely — that’s okay too) goes a long way to supporting more good stuff in the world.

My strategy is to start with the lowest tier for anyone that I’m even a little interested in. If they publish something I really like, I’ll make sure to bump that up. I think it’s one of the few feedback mechanisms that isn’t influenced by things like advertiser friendliness. So if you’ve got disposable income and you’re interested at all in voting with your dollar, I’d encourage you to give it a try.

Reading

Just kidding. I stopped reading books almost entirely over the last few months, aside from mandatory book club reading. Instead, I’ve been watching lots and lots of chinese cartoons. I think I spend about 5% of my waking hours watching anime, and another few hours reviewing memes or listening to podcasts afterward. In a typical season, I’ll usually follow between 10 to 15 weekly airing shows. But many of my favorites from the last few months have actually been older stuff: Mob Psycho 100, SHIROBAKO, and The Tatami Galaxy (and its superb movie spin-off).

Also in the last few months, I went to a con for the first time. It was Crunchyroll Expo, which for the last 2 years, has taken place down the street from where I live. I’ve always been frustrated with Crunchyroll as a technology company, but I think they did a great job attracting a lot of top talent to their convention. I’m still a little miffed that nobody asked the darlifra folks about the aliens though.

I’ve also been trying to read more manga. I signed up for Shonen Jump’s monthly subscription service (although I’ve only ever used it to reference old chapters of Naruto). Aside from big shonen titles like Shingeki no Kyojin and Boku no Hero Academia, I don’t actually read much manga.

In any case, it’s midnight at the end of the month, so I’m going to stop writing here. See you next month.

Disclaimer: I work here, but not on ads. ↩︎
I’ve spent more time recently watching super long-form content on YouTube. I think 1 to 2 hours is a nice sweet spot, where I can reasonably finish the video in a single sitting, but not so short that I need to constantly queue up more content. ↩︎

Pretty good computer security

I don’t cut my own hair. I can’t change my own engine oil. At the grocery store, I always pick Tide and Skippy, even though the generic brand is probably cheaper and just as good. “Well, that’s dumb,” you say. “Don’t you know you could save a lot of money?”, you say. But that isn’t the point. Sure, my haircut is simple. I could buy some hair clippers, and I’d love to save a trip to the dealership every year. But there are people who are professionals at cutting hair and fixing cars, and I wouldn’t trust them to program their own computers. So why should they trust me to evaluate different brands of peanut butter?

Fortunately, they’ve made the choice really simple, even for amateurs like me. In fact, everyone relies on modern convenience to some extent. Even if you cut your own hair and even if you’re not as much of a Skippy purist, you’d have to admit. People today do all kinds of crazy complicated tasks, and none of us has time to become experts at all of them.

Take driving, for example. Cars are crazy complicated. Most people aren’t professional drivers. In fact, most people aren’t even halfway decent drivers. But thanks to car engineers, you can do a pretty good job at driving by remembering just a few simple rules¹. There are lots of other things whose simplicity goes unappreciated: filing taxes and power outlets and irradiating food with freaking death rays (okay, yes, I’m just naming things I see in my immediate vicinity). We’re lucky that people long ago had the foresight to realize how important/dangerous these things could be, and so they’ve all been aggressively simplified to protect people from themselves.

Are we done then? Has everything been child proofed already? It’s easy to believe so. But if you’re unfortunate enough to still own a computer in 2018, then you’ll know that we still have a ways to go. Computer are deceptively simple. Lots of people use them: adults and little kids and old folks. But despite their wide adoption, it’s still far too easy to use a computer the wrong way and end up in a lot of trouble. Here are a few things I’ve learned about computer security that I want to share. I won’t claim that following these steps will make you invulnerable, because that’s an unattainable goal. But I think these form a pretty good baseline, with which you can at least say you’ve done your due diligence.

Stop using a computer

I’m serious. How about an iPad? Not an option? Keep reading.

Use multi-factor authentication

This is the number one easiest and most effective thing you can do to improve your security posture right away. Many websites offer a feature where you’re required to type in a special code, in addition to your username and password, before you can log in. This code can either be sent to you via text message, or it can be generated using an app on your smartphone². Multi-factor authentication is a powerful defense against account hijacking, but make sure to keep emergency codes as a backup in case you lose your MFA device.

Get a password manager

People are bad at memorizing passwords, and websites are bad at keeping your passwords safe. You should adopt some method that allows you to use a different, unpredictable password for each of your accounts, whether that’s a password manager or just a sheet of notebook paper with cryptic password hints (but see my post about data loss).

Try post-PC devices

Mobile operating systems (iOS and Android) and Chromebooks are the quintessential examples of trusted platforms with aggressively proactive protection (hardware trust anchors, signed boot, sandboxing, verified binaries, granular permissions). Most people already know this. But perhaps it’s not so obvious how poorly some of these protections translate to traditional programs.

Every program on your computer interacts with the file system, and while most well-behaved programs keep to themselves, every single one technically has access to all of your files. This puts everything on your filesystem, like your documents and even your browser cookies³, at risk. Programs have evolved with this level of flexibility, so filesystem access isn’t something that can easily be taken away. There have been attempts to sandbox traditional computer programs, like those found on the Mac App Store and Windows Store. But these features are strictly opt-in, and most computer programs will likely never adopt them.

There’s plenty of stuff to steal in the filesystem, but that’s not all. Modern operating systems offer debugging interfaces, which allow you to read and write the memory of other programs. They also offer mechanisms for you to read the contents of the clipboard, or take screenshots of other programs, or even control the user’s input devices. These all sounds like terrifying powers in the hands of a malicious program. But don’t forget that poorly-written and insecure programs can just as easily undermine your security, by allowing remote or web-based attackers to take control of them.

Because mobile and web-based platforms are newer, they’ve been able to lock down these interfaces. These new platforms don’t need to support a long legacy of older programs that expect such broad and unfettered access. Each mobile app typically gets an isolated view of the filesystem, where it can only access the files relevant to it. Debugging is disabled, and screenshots are a privilege reserved for system apps.

Don’t type passwords in public

If you film somebody typing their password, it’s surprisingly easy to play it in slow motion and see each key as it’s being pressed. This is especially true if you have a high frame rate camera, or if the victim is typing their password into a smartphone keyboard (or is just really slow at typing). Even if the attacker can’t see your fingers, there’s lots of new research in using acoustic or electromagnetic side channels to record typing from a distance. When you have to type your password, try covering your fingers by closing your laptop halfway, or go to the bathroom and do it there.

Don’t type passwords at all

Lots of people use a simple 6-digit PIN number or a pattern to unlock their smartphone. At a glance, this seems safe, because the number of possible PINs and patterns greatly exceeds how many times the phone lets you attempt to unlock it. But 10-digit keypads and pattern grids are also easily discernible from a long distance. Plus, phones that use at-rest encryption typically derive encryption keys from the unlock passphrase. There isn’t enough entropy in your 6-digit PIN or pattern to properly encrypt your data.

If you care about locking your phone securely, then you should use its fingerprint sensor and set up a long alphanumeric passphrase. You’ll rarely need to actually type your passphrase, as long as your fingerprint sensor works correctly. Thieves can’t steal your fingerprint (unless they’re really motivated), and you can rest assured that your strong password will protect your phone’s contents from brute force attacks.

Full disk encryption

If you have a computer password, but you don’t use full disk encryption, then your password is practically useless against a motivated thief. In most computers, the storage medium can be easily removed and transplanted into a different computer⁴. This technique allows an adversary to read all your files without needing your computer password.

To prevent this, you should enable full disk encryption on your computer (FileVault on macOS, or BitLocker on Windows). Lots of modern computers and smartphones enable full disk encryption by default. Typical FDE implementations use your login password and encrypt/decrypt files automatically as they’re read from and written to disk. Once the computer is shut down, the key material is forgotten and your data becomes inaccessible. Since the encryption keys are derived from your login password, you should make sure that it’s strong enough to stand up to brute force attacks.

Don’t use antivirus software

What is malware? Antivirus software use a combination of heuristics (patterns of behavior) and signatures (file checksums, code patterns, C&C servers, altogether known as indicators of compromise) to detect bad programs. They’re really effective in stopping known threats from spreading on vulnerable computer networks (picture an enterprise office network). But nowadays, all that antivirus software buys you is some peace of mind, delivered via a friendly green check mark.

A doctor in the United States wouldn’t recommend you buy a mosquito net. Similarly, antivirus software is a superfluous and unsuitable countermeasure for the modern day threats. It doesn’t protect against OAuth worms. It doesn’t protect against network-exploitable RCEs. It doesn’t protect against phishing (and when it tries to, it usually just makes everything worse). Since you’re reading this blog post, I assume you’re at least making an effort to improve your security posture. That fact alone puts you squarely in the group of people who don’t need antivirus software. So get rid of it.

Get a gaming computer

Games need to be fun. They don’t need to be secure. Given the amount of C++ code found in typical high-performance 3D games, I think it’s fairly like that most of them have at least one or two undiscovered network-exploitable memory corruption vulnerabilities. Plus, lots of games come bundled with cheat detection software. These programs typically use your operating system’s debugging interfaces to peek into the other programs running on your computer. I like computer games, but I’d be uncomfortable running them alongside the programs that I use to check my email, fiddle with money, and run my servers. So if you play games, you should consider getting a separate computer that’s specifically dedicated to running games and other untrusted programs.

Get an iPhone

You need a phone that receives aggressive security updates. You need a phone that doesn’t include obscure unvetted apps, preloaded by the manufacturer. You need a phone with million dollar bounties for security bugs. You need a phone made by people who take pride in their work, people who love presenting about their data encryption schemes, people with the courage to stand up against unreasonable demands from their government.

Now, that phone doesn’t need to be an iPhone, per se. But that’s the phone I’d feel most comfortable recommending to others. And if that’s not an option, maybe a Google Pixel?

Research your software

Who makes your software? Do you know? Have you met them? (If you live in Silicon Valley, that’s not such a rhetorical question.) Because security incidents happen all the time, it’s easy to think that they’re random and unpredictable, like getting struck by lightning. But if you pay attention, you’ll notice that a lot of incidents can be attributed to negligence, laziness, or apathy. Software companies that prioritize security are quite rare, because security is a never ending quest, it costs lots of money, and in most cases, it produces no tangible results.

If you’re a computer programmer, you should get in the habit of reading the source code of programs that you use. And if that’s unavailable, then you can use introspection tools⁵ to reverse engineer them. After all, how can you trust a piece of software to be secure if you’ve never even tried to attack it?

You should also research the software publishers, and decide which ones you do and don’t trust. Do they have a security incident response team? Do they have a security bug bounty? A history of negligent security practices? Are they known for methodical engineering? Or do they prefer getting things done the quick and easy way? If you only run programs written by publishers that you trust, then you can greatly limit your exposure against poor engineering.

Use secure communication protocols

The gold standard in secure communication is end to end encryption with managed keys. This is where your messages are only readable by you and the recipient, but you rely on a trusted third party to establish the identity of the recipient in the first place. You don’t always need to use end to end encryption, since other methods might be more widely adopted or convenient.

Don’t use browser extensions

Browser extensions are so incredibly useful, but I don’t feel comfortable recommending any browser extension to anybody (except those that don’t require any special permissions). There have been so many recent incidents where popular browser extensions were purchased from the original developer, in order to force ads and spyware onto their large preexisting user base. Browser extensions represent one of the few security weaknesses of the web, as a platform. So, you should only trust browser extensions published by well-established tech companies or browser extensions that you’ve written (or audited) yourself.

Turn off JavaScript

I mentioned before that websites run in an isolated sandbox, away from the rest of your computer. That fact supposedly makes web applications more secure than traditional programs. But as a result, web browsers don’t hesitate to boldly execute whatever code they’re fed. To reduce your exposure, you should consider only allowing JavaScript for a whitelist of websites where you actually need it. For example, there are plenty of news websites that are perfectly legible without JavaScript. They’re also the sites that use sketchy ad networks full of drive-by malware, probing for vulnerabilities in your browser.

Browsers allow you to disable JavaScript by default and enable it on a site-by-site basis. When you notice that a site isn’t working without JavaScript, you can easily enable it and refresh the page.

Don’t get phished

The internet is full of liars. You can protect yourself by paying attention to privileged user interface elements (like the address bar), using bookmarks instead of typing URLs, and maintain a healthy amount of skepticism.

Get ready to wipe

If you suspect that malware has compromised your computer, then you should wipe and reinstall your operating system. Virus removal is a fool’s errand. As long as you have recent data backups (and you’re certain that your backups haven’t also been infected), then you should be able to wipe your computer without hesitation.

One pedal goes vroom. The other brakes. Clockwise goes to your right, and counter-clockwise, to your other right. If the dashboard lights up orange, pull over and check your tires. These simple rules are all you need to know in order to drive a car. The other parts don’t really matter. We expect cars to Just Work™, and for the most part, they do. ↩︎
Or a U2F security key, if you really love spending money. ↩︎
Encryption using kernel-managed keys makes this more difficult ↩︎
Soldered components can make this harder. ↩︎
Look at its open files, its network traffic, and its system calls. ↩︎

The rules are unfair

It’s 11pm. I should go to sleep. But sometimes, when it’s 11pm and I know I should go to sleep, I don’t. Instead, I stay up and watch dash cam videos on the internet. Nowadays, you can pop open your web browser and watch what are perhaps the worst moments of someone’s life, on repeat, in glorious high definition. It’s tragic, but also viscerally entertaining and conveniently packaged into small bite sized clips for easy consumption. Of course, not all the videos involve car accidents. Some of them just show bad drivers doing stupid things that happened to be caught on camera. In any case, I’ll probably read some comments, have a chuckle, and then eventually feel guilty enough to go to bed.

Dash cams are impartial observers. They give us rare glimpses into the real world, without the taint of unreliable human narrators¹. Unlike most internet videos, there aren’t any special effects or editing. Dash cams don’t jump around from one interesting angle to the next, and they don’t come with cinematic, mood-setting background music (unless the cammer’s radio happens to be on). Instead, dash cams just look straight forward, paying as much attention to the boring as to the interesting, recording a small slice of history with the utmost dedication, until some impact inevitably knocks them off their windshield mount.

You can analyze dashcam footage, frame by frame, to see exactly why things played out the way they did. Most of us drive every day, and most of us know the risks, but these videos don’t bother us. We convince ourselves that “that could never happen to me, because I don’t follow trucks that closely” or “I always look both ways before entering a fresh green light”. Even when an accident or a near miss isn’t the cammer’s fault, we can still usually find something to blame them for. Somebody else may have caused the accident, but perhaps it was the cammer’s own poor choices that created the opportunity to begin with. Who knows? If the cammer had driven more defensively, then maybe nothing would have happened, and there would be one fewer video delaying my bedtime this evening. It’s only natural for us to demand such an explanation. We want evidence that the cammer deserved what happened to them. Because why would bad things just happen to not-bad people for no reason? That makes no sense. We’re not-so-bad people ourselves, after all. And if we can’t find fault with the cammer, then why aren’t all those same terrible things happening to us?

A few weeks ago, some Canadian kid downloaded a bunch of unlisted documents² from a government website, and so the police went to his house and took away all his computers. “Bah, that’s absurd!”, you say. “Just another handful of bureaucrats that don’t understand technology”, you say. And as a fan of computers myself, I’d have to agree. It’s obviously the web developer’s fault that the documents were accessible without needing to log in, and it’s the web developer’s fault that even a teenager could trivially guess their URLs. But I also realize that, as computer programmers, it’s our natural instinct to focus only on the technical side of an issue. After all, how many countless months have we spent working on access control and security for computer systems? It’s easy for us to see this as a technical failure, to indemnify the kid, and to place the blame entirely on whoever created the website. But alas, the police don’t arrest bad programmers for making mistakes³.

Plenty of people scrape all kinds of data from websites, sometimes in ways that don’t align with the web developer’s intentions. Should they all expect the police to raid their houses too? Well, there are actually several legitimate reasons why this kind of activity could be considered immoral or even illegal (copyright infringement, violating terms of service, or consuming a disproportionate amount of computer resources). But a lot of laws and user agreements also use phrases like “unauthorized use of computers”, and with such vague language, even something as innocent as downloading publicly available files could be classified as a violation. I mean, sure. It’d be difficult to prove that Canada kid had any kind of criminal intent. But it’s pretty clear that nobody authorized him, explicitly or implicitly (by way of hyperlinks), to download all of those files. Isn’t that “unauthorized use”?

Laws are complicated. I’m no lawyer. I don’t even understand most of the regulations that apply to computers and the internet in the United States. But recently, I’ve come to realize that there’s a fundamental disconnect between laws and morality. When people don’t understand their laws, they tend to just assume that the law basically says “do the right thing”. And if you have even a shred of self respect, then you’ll probably reply “I’m doing that already!”. Naturally, we can’t be expected to know all the particulars of what laws do and don’t allow us to do. So, some amount of intuition is required. Fortunately, nobody bothers to prosecute minor infractions, and people tend to talk a lot about unintuitive or immoral laws (marriage, abortion, weed, etc), so they quickly become common knowledge. But I’d argue that laws aren’t actually an approximation of morality. It’d be more accurate to say that laws are an approximation of maximum utility (much like everything else, if you believe utilitarianism).

There are some laws that exist, not because they’re moral truths, but because people are just generally happier when everybody obeys them. Is it immoral to suggest that people are predisposed to certain careers on the basis of their protected classes? Hard to say, but that’s beside the point. We don’t permit those ideas in public discourse, because people are, as a whole, happier and more productive when we don’t talk about those things.

Laws also aren’t an approximation of fairness, but are only fair to the extent that perceived fairness contributes to overall utility. What makes a fair law? That it benefits everyone equally? Obviously, a fair healthcare system should benefit really sick people more than it benefits people who just want antibiotics for their cold. Maybe a fair law should apply equally to everyone, regardless of their protected classes. But under that definition, regressive income tax and compulsory sterilization would be classified as fair laws. Should wealth and fitness be protected classes? What about age? There are plenty of laws that apply only to children⁴, but then again, maybe that’s why they’re always complaining that the rules are unfair.

I’ve learned to reduce my trust in rules, and I’ve started to distinguish what’s right from what’s allowable. You could argue that, in some sense, utilitarianism is the ultimate form of morality. But there’s a bunch of pitfalls in that direction, so I’ll stop writing now and go to sleep.

The coast near Pacifica, California.

Or so we’d like to think. I feel like most of the time, dash cam videos unfairly favor the cammer. Unlike humans, dash cams can’t turn their head, have a narrower field of vision, and usually only show what’s in front of the vehicle. Actual drivers are expected to be more aware of their surroundings than what a dash cam allows. ↩︎
Even though the documents were unlisted, they had predictable numeric URLs, and so it was trivial to guess the document addresses and download them all. ↩︎
They blame architects when bridges collapse. Should computer programs that are relevant to public safety be held to the same standard? ↩︎
I think that it’s impossible to believe in fairness unless you also believe in the idea of the individual. Are children individuals? What about conjoined twins? ↩︎

Computer programming

I think I like being a computer programmer. Really. It’s not just something I say because it sounds good in job interviews. I’ve written computer programs almost every day for the last five years. I spend so much time writing programs that I can’t imagine myself not being a computer programmer. Like, could you enjoy food without knowing how to cook? Or go to a concert, having never seen a piece of sheet music? Yet plenty of people use computers while not also knowing how to reprogram them. That’s completely foreign to me and makes me feel uncomfortable. Fortunately, watching uncomfortable things also happens to be my favorite pastime.

In fact, I spend a lot of time just watching regular people use their computers. If we’re on the bus and I’m staring over your shoulder at your phone, I promise it’s not because I have any interest in reading your text messages. I actually just want to know what app you’re using to access your files at home. After all, you probably have just as many leasing contracts PDFs and scanned receipts as I do. Yet, you somehow manage to remotely access yours without any servers or programmable networking equipment in your apartment.

My own digital footprint has gotten bigger and more complicated over the years. Right now, the most important bits are split between my computers at home and a small handful of cloud services¹. At home, I actually use two separate computers: my MacBook Pro and a small black and silver Intel NUC that sits mounted on my white plastic Christmas tree. The NUC runs Ubuntu and is responsible for building all the personal software I write, including Hubnext which runs this website. I also use it as a staging area for backups and large uploads to the cloud that run overnight (thanks U.S. residential ISPs). I use the MacBook Pro for everything else: web browsing, media consumption, and programming. It’s become fairly easy to write code on one machine and run it on the other, thanks to a collection of programs I slowly built up. Maybe I’ll write about those one day.

At some point after starting my full time job, I noticed myself spending less and less time programming at home. It soon became apparent that I was never going to be able to spend as much time on my personal infrastructure as I wanted to. In college, I ran disaster recovery drills every few months or so to test my backup restore and bootstrapping procedures. I could work on my personal tools and processes for as long as I wanted, only stopping when I ran out of ideas. Unfortunately, I no longer have unlimited amounts of free time. I find myself actually estimating the required effort for different tasks (can you believe it?), in order to weigh them against other priorities. My to-do list for technical infrastructure continues to grow without bound, so realistically, I know that I’ll never be able to complete most of those tasks. That makes me sad. But perhaps what’s even worse is that I’ve lost track of what I liked about computer programming in the first place.

The shore near Pacifica Municipal Pier.

Lately, I’ve been spending an inordinate amount of time working on Chrome extensions. There are a handful that I use at home and others that I use at work. But unlike most extensions, I don’t publish these anywhere. They’re basically user scripts that I use to customize websites to my liking. I have one that adds keyboard shortcuts to the Spotify web client. Another one calculates the like to dislike ratio and displays it as a percentage on YouTube videos. Lots of them just hide annoying UI elements or improve the CSS of poorly designed websites. Web browsers introduced all sorts of amazing technologies into mainstream use: sandboxes for untrusted code, modern ubiquitous secure transport, cross-platform UI kits, etc. But perhaps the most overlooked is just how easy they’ve made it for anybody to reverse engineer websites through the DOM². Traditional reverse engineering is a rarefied talent and increasingly irrelevant for anybody who isn’t a security researcher. But browser extensions are approachable, far more useful, and also completely free, unlike a lot of commercial RE tools.

When I work on browser extensions, I don’t need a to-do list or any particular goal. I usually write extensions just to scratch an itch. I’ll notice some UI quirk, open developer tools, hop over to Atom, fix it, and get right back to browsing with my new modification in place. Transitioning smoothly between web browsing and extension programming is one of the most pleasant experiences in computer programming: the development cycle is quick, the tools are first class, and the payoff is highly visible and immediate. It made me remember that computer programming could itself be enjoyed as a pastime, rather than means to an end.

The bluffs at Mori Point.

I’m kind of tired of sitting down every couple months to write about some supposedly fresh idea or realization I’ve had. At some point, I’ll inevitably run out of things to say. Until that happens, I guess I’ll just keep rambling.

The other major recent development in my personal programming work is that I’ve started merging all my code into a monorepo. The repo is just named “R”, because one advantage of working for yourself is that you don’t have to worry about naming anything the right way³. It started out with just my website stuff, but since then I’ve added my security camera software, my home router configuration, my data backup scripts, a bunch of docs, lots of experimental stuff, and a collection of tools to manage the repo itself. Sharing a single repo for all my projects comes with the usual benefits: I can reuse code between projects. It’s easier to get started on something new. When I work on the repo, I feel like I’m working on it as a whole, rather than any individual project. Plus, code in the repo feels far more permanent and easy to manage than a bunch of isolated projects. It’s admittedly become a bit harder to release code to open source, but given all the troublesome corporate licensing nonsense involved, I’m probably not planning to do that anyway.

I feel like I’m just now finishing a year-long transition process from spending most of my time at school to spending most of my time at work. It took a long time before I really understood what having a job would mean for my personal programming. My hobby had been stolen away by my job, and to combat that feeling, I dedicated lots of extra time to working on personal projects outside of work. But instead of finding a balance between hobbies and work, the extra work just left me burnt out. That situation has improved somewhat, partially because I’m been focusing on projects that let me reduce how much time I spend maintaining personal infrastructure, but also because I’ve accomplished all the important tasks and I’ve learned to treat the remainder as a pastime instead of as chores. But there’s still room for improvement. So, if you’re wondering what I’m working on cooped up in my apartment every weekend, this is it.

At some point, I should probably write an actual guide about getting started with computer programming. People keep emailing me about how to get started, even though I’m not a teacher (at least not anymore) and I don’t even regularly interact with anyone who’s just starting to learn programming. And I should probably do it before I start to forget what it’s like to enjoy programming altogether.

Things are supposedly set up with enough redundancy to lose any one piece, but that’s probably not entirely true. I’ve written about this before. In any case, one of the greatest perks of working at the big G is the premium customer support that’s implicitly offered to all employees, especially those in engineering. If things really went south, I guess I could rely on that. ↩︎
I really appreciate knowing just enough JavaScript to customize web applications with extensions. It’s one of the reasons I gave up vim for Atom. ↩︎
Lately, I’ve been fond of naming things after single letters or phonetic spellings of letters. I have other things named A, B, c2, Ef, gee, Jay, K, L, Am, and X. ↩︎