Computer programming

I think I like being a computer programmer. Really. It’s not just something I say because it sounds good in job interviews. I’ve written computer programs almost every day for the last five years. I spend so much time writing programs that I can’t imagine myself not being a computer programmer. Like, could you enjoy food without knowing how to cook? Or go to a concert, having never seen a piece of sheet music? Yet plenty of people use computers while not also knowing how to reprogram them. That’s completely foreign to me and makes me feel uncomfortable. Fortunately, watching uncomfortable things also happens to be my favorite pastime.

In fact, I spend a lot of time just watching regular people use their computers. If we’re on the bus and I’m staring over your shoulder at your phone, I promise it’s not because I have any interest in reading your text messages. I actually just want to know what app you’re using to access your files at home. After all, you probably have just as many leasing contracts PDFs and scanned receipts as I do. Yet, you somehow manage to remotely access yours without any servers or programmable networking equipment in your apartment.

My own digital footprint has gotten bigger and more complicated over the years. Right now, the most important bits are split between my computers at home and a small handful of cloud services1. At home, I actually use two separate computers: my MacBook Pro and a small black and silver Intel NUC that sits mounted on my white plastic Christmas tree. The NUC runs Ubuntu and is responsible for building all the personal software I write, including Hubnext which runs this website. I also use it as a staging area for backups and large uploads to the cloud that run overnight (thanks U.S. residential ISPs). I use the MacBook Pro for everything else: web browsing, media consumption, and programming. It’s become fairly easy to write code on one machine and run it on the other, thanks to a collection of programs I slowly built up. Maybe I’ll write about those one day.

At some point after starting my full time job, I noticed myself spending less and less time programming at home. It soon became apparent that I was never going to be able to spend as much time on my personal infrastructure as I wanted to. In college, I ran disaster recovery drills every few months or so to test my backup restore and bootstrapping procedures. I could work on my personal tools and processes for as long as I wanted, only stopping when I ran out of ideas. Unfortunately, I no longer have unlimited amounts of free time. I find myself actually estimating the required effort for different tasks (can you believe it?), in order to weigh them against other priorities. My to-do list for technical infrastructure continues to grow without bound, so realistically, I know that I’ll never be able to complete most of those tasks. That makes me sad. But perhaps what’s even worse is that I’ve lost track of what I liked about computer programming in the first place.

The shore near Pacifica Municipal Pier.

Lately, I’ve been spending an inordinate amount of time working on Chrome extensions. There are a handful that I use at home and others that I use at work. But unlike most extensions, I don’t publish these anywhere. They’re basically user scripts that I use to customize websites to my liking. I have one that adds keyboard shortcuts to the Spotify web client. Another one calculates the like to dislike ratio and displays it as a percentage on YouTube videos. Lots of them just hide annoying UI elements or improve the CSS of poorly designed websites. Web browsers introduced all sorts of amazing technologies into mainstream use: sandboxes for untrusted code, modern ubiquitous secure transport, cross-platform UI kits, etc. But perhaps the most overlooked is just how easy they’ve made it for anybody to reverse engineer websites through the DOM2. Traditional reverse engineering is a rarefied talent and increasingly irrelevant for anybody who isn’t a security researcher. But browser extensions are approachable, far more useful, and also completely free, unlike a lot of commercial RE tools.

When I work on browser extensions, I don’t need a to-do list or any particular goal. I usually write extensions just to scratch an itch. I’ll notice some UI quirk, open developer tools, hop over to Atom, fix it, and get right back to browsing with my new modification in place. Transitioning smoothly between web browsing and extension programming is one of the most pleasant experiences in computer programming: the development cycle is quick, the tools are first class, and the payoff is highly visible and immediate. It made me remember that computer programming could itself be enjoyed as a pastime, rather than means to an end.

The bluffs at Mori Point.

I’m kind of tired of sitting down every couple months to write about some supposedly fresh idea or realization I’ve had. At some point, I’ll inevitably run out of things to say. Until that happens, I guess I’ll just keep rambling.

The other major recent development in my personal programming work is that I’ve started merging all my code into a monorepo. The repo is just named “R”, because one advantage of working for yourself is that you don’t have to worry about naming anything the right way3. It started out with just my website stuff, but since then I’ve added my security camera software, my home router configuration, my data backup scripts, a bunch of docs, lots of experimental stuff, and a collection of tools to manage the repo itself. Sharing a single repo for all my projects comes with the usual benefits: I can reuse code between projects. It’s easier to get started on something new. When I work on the repo, I feel like I’m working on it as a whole, rather than any individual project. Plus, code in the repo feels far more permanent and easy to manage than a bunch of isolated projects. It’s admittedly become a bit harder to release code to open source, but given all the troublesome corporate licensing nonsense involved, I’m probably not planning to do that anyway.

I feel like I’m just now finishing a year-long transition process from spending most of my time at school to spending most of my time at work. It took a long time before I really understood what having a job would mean for my personal programming. My hobby had been stolen away by my job, and to combat that feeling, I dedicated lots of extra time to working on personal projects outside of work. But instead of finding a balance between hobbies and work, the extra work just left me burnt out. That situation has improved somewhat, partially because I’m been focusing on projects that let me reduce how much time I spend maintaining personal infrastructure, but also because I’ve accomplished all the important tasks and I’ve learned to treat the remainder as a pastime instead of as chores. But there’s still room for improvement. So, if you’re wondering what I’m working on cooped up in my apartment every weekend, this is it.

At some point, I should probably write an actual guide about getting started with computer programming. People keep emailing me about how to get started, even though I’m not a teacher (at least not anymore) and I don’t even regularly interact with anyone who’s just starting to learn programming. And I should probably do it before I start to forget what it’s like to enjoy programming altogether.

  1. Things are supposedly set up with enough redundancy to lose any one piece, but that’s probably not entirely true. I’ve written about this before. In any case, one of the greatest perks of working at the big G is the premium customer support that’s implicitly offered to all employees, especially those in engineering. If things really went south, I guess I could rely on that. ↩︎
  2. I really appreciate knowing just enough JavaScript to customize web applications with extensions. It’s one of the reasons I gave up vim for Atom. ↩︎
  3. Lately, I’ve been fond of naming things after single letters or phonetic spellings of letters. I have other things named A, B, c2, Ef, gee, Jay, K, L, Am, and X. ↩︎

Net neutrality

I don’t know a single person who actually supports the FCC’s recent proposal to deregulate broadband internet access by reclassifying it as an information service. However, I also never did meet a single person who unironically supported last year’s President-elect, but that doesn’t seem to have made any difference. There were certainly a lot of memes in last year’s U.S. election cycle, but I remember first seeing memes about net neutrality when I was in high school. That was before all the Comcast-Verizon-YouTube-Netflix hubbub and before net neutrality was at the forefront of anyone’s mind. So naturally, net neutrality got tossed out and ignored along with other shocking but “purely theoretical” future problems like climate change and creepy invasive advertising1. But I’ve seen some of those exact same net neutrality infographics resurface in the last couple weeks, and in retrospect, I realize that many of them were clearly made by people weren’t network engineers and weren’t at all familiar with how the business of being an ISP actually works. And so, the materials used in net neutrality activism were, and still are, sometimes inaccurate, misleading, or highly editorialized to scare their audience into paying attention2.

Now, just to be clear: Do I think the recent FCC proposals are in the best interest of the American public? Definitely not. And do I think that the general population is now merely an audience to political theater orchestrated by, or on behalf of, huge American residential ISPs? I think it’s probable. After all, why bother proposing something so overwhelmingly unpopular unless it’s already guaranteed to pass? Nevertheless, I feel like spreading misinformation about the realities of net neutrality is only hurting genuine efforts to preserve it.

Before I begin, I should mention that you can read the full 210 page proposal on yourself if you want the full picture. I hate that news websites almost never link to primary sources for the events they’re covering, and yet nobody really holds them accountable to do so. Anyway, I’m no lawyer, but I feel like in general, the legal mumbo-jumbo behind controversial proposals is usually far more subtle and less controversial-sounding than the news would have you believe. That’s very much the case here. And for what it’s worth, I think this particular report is very approachable for normal folks and does do a good job of explaining context and justifying its ideas.

Brush plant in San Francisco.

If you haven’t read the proposal, the simple version is that in mid-December 2017, the FCC will vote on whether to undo the net neutrality rules they set in 2015 and replace them with the requirement that ISPs must publish websites that explain how their networks work. The proposal goes on to explain how the existing rules stifle innovation by discouraging investment in heavily-regulated network infrastructure. In practice, the proposed changes would eliminate the “Title II” regulations that prevent ISPs from throttling or blocking legal content and delegate the policing of ISPs to a mix of the FTC, antitrust laws, and market forces.

It’s hard to predict the practical implications of these changes, but based on past examples, many people point to throttling of internet-based services that compete with ISPs (like VoIP) and high-bandwidth applications (like video streaming and BitTorrent), along with increased investment in ISP-owned alternatives to internet services for things like video and music streaming. As a result, a lot of net neutrality activism revolves around purposefully slowing down websites or presenting hypothetical internet subscription plans that charge extra fees for access to different kinds of websites (news, gaming, social networking, video streaming) in order to illustrate the consequences of a world without net neutrality. But in reality, neither of these scenarios is very realistic, and yet both of them already exist in some form today.

You probably don’t believe me when I say that throttling is unrealistic, and I understand. After all, we’ve already seen ISPs do exactly that to Netflix and YouTube. But first, you should understand a few things about residential ISPs.

Like most networks, residential ISPs are oversubscribed, meaning that the individual endpoints of the network are capable of generating more traffic than the network itself can handle. They cope with oversubscription by selling subscription plans with maximum throughput rates, measured in megabits per second3. Upstream traffic (sent from your home to the internet) is usually throttled down to these maximum rates by configuration on your home modem. But downstream traffic (sent from the internet to your home) is usually throttled at the ISP itself by delaying or dropping traffic that exceeds this maximum configured limit. So you see, the very act of throttling or blocking traffic isn’t a concern for net neutrality. In fact, most net neutrality regulations have exemptions that allow this kind of throttling when it’s for purely technical reasons, because some amount of throttling is an essential part of running a healthy network.

Furthermore, all ISPs already discriminate (e.g. apply different throttling or blocking rules) against certain types of traffic by way of classification. At the most basic level, certain types of packets (like zero-length ACKs) and connections (like “mice flows”) are given priority over others (like full sized data packets for video streaming) as part of a technique known as quality of service (QoS). Many ISPs also block or throttle traffic on certain well-known ports, such as port 25 for email and port 9100 for printing, because they’re commonly abused by malware and there’s usually no legitimate reason to route such traffic from residential networks onto the open internet. Furthermore, certain kinds of traffic can be delivered more quickly and reliably simply because of networking arrangements made between your ISP and content providers (like Netflix Open Connect). In other cases, your ISP may be stuck in a disadvantageous peering agreement, in which it has to pay another ISP extra money to send or receive traffic on certain network links, in addition to just the costs of maintaining the network itself.

People generally agree that none of these count as net neutrality violations, because they’re practical aspects of running a real network and, in many cases, they justify themselves by providing real value to end users. It’s difficult to explain concisely what divides these kinds of blocking and throttling from the scandalous net neutrality kind. Supposedly, net neutrality violations typically involve blocking or throttling for “business” reasons, but “reducing network utilization by blocking our competitors” could arguably have technical benefits as well. In practice, most people call it a net neutrality violation when it’s bad for customers and call it “business as usual” when it’s either beneficial for customers or represents the way things have always worked. In any case, the elimination of all blocking and throttling is neither practical nor desirable. When discussing net neutrality, it’s important to acknowledge that many kinds of blocking and throttling are legitimate and to (try to) focus on the kinds that aren’t.

Leaves in Golden Gate Park.

Websites that purposefully slow themselves down paint a wildly inaccurate picture of a future without net neutrality, especially when they do so without justification. ISPs gain nothing from indiscriminate throttling, other than saving a couple bucks on power and networking equipment. Plus, ISPs can (and do) get the same usage reduction benefits by imposing monthly bandwidth quotas, which have nothing at all to do with net neutrality. I think a more likely outcome is that ISPs will start pushing for the adoption of new heterogeneous internet and TV combo subscription plans. These plans will impose monthly bandwidth quotas on all internet traffic except for a small list of partner content providers, which will complement a larger emphasis on ISP-provided TV and video on demand services. After all, usage of traditional notebook and desktop computers is on the decline in favor of post-PC era devices like smartphones and tablets. A number of U.S. households would probably be perfectly happy to trade boring unmetered internet for a 10GB/month residential broadband internet plan combined with a TV subscription and unlimited access to the ISP’s first-party video on demand service along with a handful of other top content providers. Such a plan could eliminate the need for third-party video streaming subscriptions like Netflix, thereby providing more content for less money. Naturally, a monthly bandwidth quota would make it difficult for non-partner video streaming services to compete effectively, but fuck them, right?

I should point out that no matter what happens to net neutrality, we’ll still have antitrust laws (the proposal mentions this explicitly) and an aggressive DoJ to chase down offenders. Most ISPs operate as a local monopoly or duopoly. So, using their monopoly position in the internet access market to hinder competition in internet content services sounds like an antitrust problem to me. But it’s possible that the FCC’s reclassification of internet access as an “information service” may change this.

The other example commonly used by net neutrality activists is the a-la-carte internet subscription. In this model, different categories of content (news, gaming, social networking, video streaming) each require an extra fee or service tier, sort of like how HBO and Showtime packages work for TV today. For this to work, ISPs need to be able to block access to content that subscribers haven’t paid for. In the past, this might have been implemented with a combination of protocol-based traffic classification (like RTMP for video streaming), destination-based traffic classification (well known IP ranges used by online games), and plain old traffic sniffing (reconstructing plaintext HTTP flows). But such a design would be completely infeasible from a technical standpoint in today’s internet.

First, nearly all modern internet applications use some variant of HTTP to carry the bulk of their internet traffic. Even applications that traditionally featured custom designed protocols (video conferencing, gaming, and media content delivery) now almost exclusively use HTTP or HTTP-based protocols (HLS, WebSockets, WebRTC, REST, gRPC, etc). This is largely because HTTP is the only protocol that has both ubiquitous compatibility with corporate firewalls and widespread infrastructural support in terms of load balancing, caching, and instrumentation. As a result, it’s far more difficult today to categorize internet traffic with any degree of certainty based on the protocol alone.

Additionally, most of the aforementioned HTTP traffic is encrypted (newer variants like SPDY and HTTP/2 virtually require encryption to work). For the a-la-carte plan to work, you need to first categorize all internet traffic. We can get some hints from SNI and DNS, but that’s not always enough and also easily subverted.

Internet applications with well-known IP ranges are also a thing of the past. Colocation has given way to cloud hosting, and it’s virtually impossible to tell what’s inside yet another encrypted HTTPS stream to some AWS or GCP load balancer.

Essentially, there can’t truly exist a “gaming” internet package without the cooperation of every online game developer on the planet.

A-la-carte models work well with TV subscriptions because there are far fewer parties involved. If ISPs ever turn their attention to gamers, it’s most likely that they’ll partner with a few large “game networks” that can handle both the business transactions4 and the technical aspects of identifying and classifying their game traffic on behalf of residential ISPs. So, you probably won’t be buying an “unlimited gaming” internet package anytime soon. Instead, you’ll be buying unlimited access to just Xbox Live and PSN. From that point on, indie game developers will simply have to fit in your monthly 10GB quota for “everything else”.

Reeds near the San Francisco Bay.

Net neutrality activists say that net neutrality will keep the internet free and open. But the very idea of a “free and open” internet is a sort of a myth. To many people, the ideal internet is a globally distributed system of connected devices, where every device can communicate freely and equally with every other device. In a more practical sense, virtually anybody should have the power to publish content on the internet, and virtually anybody should be able to consume it. No entity should have control over the entire network, and no connection to the internet should be more special than any other, because being connected to the internet should mean that you’re connected to all of it.

In reality, people have stopped using the internet to communicate directly with one another. Instead, most internet communication today is mediated by the large global internet corporations that run our social networks, our instant messaging apps, our blogs, our news, and our media streaming sites. Sure, you’re “free” to post whatever garbage you’d like to your Tumblr or Facebook page, but only as long as those companies allow you to do so.

But the other half of a “free and open” internet means that anybody can start their own Facebook competitor and give control back to the people, right? Well, if you wanted to create your own internet company today, your best bet would be to piggyback off of an existing internet hosting provider like AWS or GCP, because they’ve already taken care of the prerequisites for creating globally accessible (except China) internet services. At the physical level, the internet is owned by the corporations, governments, and organizations that operate the networking infrastructure and endpoints that constitute the internet. The internet only works because those operators are incentivized to bury thousands of miles of optical fiber beneath the oceans and hang great big antennas on towers in the sky. It only works because, at some point in the past, those operators made agreements to exchange traffic with each other5 so that any device on one network could send data that would eventually make it to any other network. There’s nothing inherent about the internet that ensures the network is fully connected (in fact, this fully-connectedness breaks all the time during BGP leaks). It’s the responsibility of each individual content publisher to make enough peering arrangements to ensure they’re reachable from anyone who cares to reach them.

Not all internet connections are equal. Many mobile networks and residential ISPs use technologies like CGNAT and IPv6 tunneling to cope with IPv4 address exhaustion. But as a result, devices on those networks must initiate all of their traffic and cannot act as servers, which must be able to accept traffic initiated by other devices. In practice, this isn’t an issue, because mobile devices and (most) residential devices initiate all of their traffic anyway. But it does mean that such devices are effectively second class citizens on the internet.

It’s also an increasingly common practice to classify internet connections by country. Having a privileged IP address, like an address assigned to a residential ISP in the United States, means having greater access and trust when it comes to country-restricted media (like Netflix) and trust-based filtering (like firewalls), compared to an address assigned to a third world country or an address belonging to a shared pool used by a public cloud provider. This is especially the case with email spam filtering, which usually involves extensive IP blacklists in addition to rejecting all traffic from residential ISPs and shared public cloud IP pools.

Finally, let’s not forget those countries whose governments choose to filter or turn off the internet entirely on some occasions. But they have bigger things to worry about than net neutrality anyway.

So, is the internet doomed? Not quite. It’s already well known that last mile ISPs suck and IPv4 exhaustion sucks and IP-based filtering sucks. But as consumers, we still need strong government regulation on residential internet service providers, just like we need regulation on any monopolistic market. People often say that the internet is an artifact, not an invention. So we all share a responsibility to make it better, but we should try to do so without idealistic platitudes and misleading slogans.

  1. I’m kidding, of course. ↩︎
  2. This seems to be quite common with any kind of activism where people try to get involved via the internet. ↩︎
  3. This is not the only way to sell networking capacity. Many ISPs charge based on total data transfer and allow you to send/receive as fast as technically possible. This is especially common with cellular ISPs and cloud hosting. ↩︎
  4. “Business transactions” ↩︎
  5. Read all about it on ↩︎

Feel bad

The hardest truth I had to learn growing up is that not everyone is interested in the truth, and maybe that’s okay. I’m not talking about material truths like the mounting evidence for climate change or incredible importance of space exploration. There’s nothing to debate about material truths, and while some people dispute them, you can hardly argue that ignoring them is okay. I’m talking about truths about people. I’m talking about whether our laws are fair to the poor. I’m talking about whether certain races or genders or social classes are predisposed to be better or worse at their jobs. I’m talking about whether regulated capitalism is the best we can ask for, whether crime is a racial issue, whether homosexuality is a mental illness, whether the magic sky man really does watch everything you do, whether abortion is baby murder, whether Steve Jobs would like the new iPhone, and whether we’re cold-hearted monsters for selling guns to people who can’t make up their minds about any of these issues in a country where people die from preventable gun violence every day. I want to know why these questions cause so many problems, and I want to know why, in many cases, it seems like the best thing to do is to ignore them.

Here’s some truth for you: I don’t particularly mind if nobody reads this blog, and yet I know that some people do. I know there’s spillover traffic from the number one most popular page on this site, and although I haven’t checked Google Analytics in a few months, I know that the spillover is enough so that every few hours, some poor sucker winds up on one of these blog posts and miraculously reads the whole thing. I could just as happily write to a quiet audience as I could to nobody at all, because ultimately I just like the idea of publishing my thoughts to the public record. Here’s another truth: I’m not wearing pants (I almost never wear pants at home). I finished my dinner and finished washing the dishes and wiping down the counter and now I’m sitting in the dark with no pants on, wondering whether it’s time for a new laptop. More truths: the deadlines at work are getting to my head, even though I won’t admit it. I don’t exactly like my work, but I do like my workplace, and the thought of leaving is too frightening to even consider right now. I feel like my list of things to do has been growing endlessly ever since I graduated from college, and I try not to think of what things will be like in a year, because I’m afraid I won’t like the answer.

When I was around 15 years old, I started getting all kinds of ideas in my head, and the biggest one was that the internet could teach me all kinds of truths that nobody would teach me before. At that moment, I felt like my brain had unlocked an extra set of cores1. I thought that, given enough time, a mind in isolation could discover all the grand truths of the universe. I wondered if that mind could be my mind, and I couldn’t wait to get started.

Some of the discovery was just disgusting stuff. I went on the internet searching for the nastiest, most shocking images I could find, because I thought they would help me better understand how people’s brains worked. After all, images were just a bunch of colorful pixels. I thought it was a form of intellectual impurity to think that some images, and the ideas they depicted, should be off-limits to a curious mind. So, forcible desensitization was the only way.

But I actually spent most of my time reading about science. I learned as much physics as Wikipedia and Flash web applets could teach a teenager…which turned out to be mostly unhelpful when the time came to learn actual physics. I learned about aviation accidents and nuclear proliferation and ancient humans. I learned about the history of American space exploration and how modern religions were developed. I learned that people in the United States lived quite differently than most of the world, in more ways than what wealth alone could explain. But most importantly, I learned (what I thought were) several uncomfortable truths about people, particularly truths that many people denied2.

For many years, I went on carrying these ideas quietly in my head. They made me cynical and disdainful toward other people who didn’t think the way I did, because in my head, they weren’t just different. They were wrong. I can always identify other people that had the same experience growing up (or at least I imagine that I can). They all have the same dead fish look in their eye that says “you’re too afraid to face the truth, but I’m not”. And frankly, I very much still miss those days when all I needed was a computer and a deeply rooted misbelief to disprove.

I’m surprised that I ever grew out of that phase. I suppose everyone grows out of everything, sooner or later. But it’s not that my beliefs have changed or weakened. Rather, I realized that most of my cynicism was based on the false premise that nothing was more important than the truth.

It’s hard to look past a truth-centric view of the world when people spend so much time just trying to prove other people wrong. But as a matter of fact, the ultimate goal (the “objective function”) for a human isn’t how right you are3. People want simpler things. They want to feel good and they want the reassurance that they’ll continue to feel good in the future. They want to feel accepted and live productive purposeful lives. Finding the truth can help advance these objectives, but it is not itself a goal. At the end of the day, most of us are tired from working a full-time job and can’t be bothered to care so much. After all, if a couple of white lies make lots of people happier, who’s to say that that’s not okay?

Parking lot at Castle Rock State Park, California.

  1. An extra lobe of gray matter? ↩︎
  2. Like how the government is mind controlling people with cell towers and chemtrails. ↩︎
  3. Yes, even for people who really enjoy being right. ↩︎

The serving path of Hubnext

I’ve always liked things straight and direct to the point. But I feel a bit silly writing that here, since most of my blog posts from high school used a bunch of long words for no reason. My writing did eventually become more concise, but it happened around the time I also sort of stopped blogging. I stopped writing altogether for a few years, but then in college, I started doing an entirely different kind of writing. I wrote homework assignments and class announcements for the computer science classes as a teaching assistant. Technical writing is different from blogging, but writing for a student audience also has its own unique challenges. The gist of it is that students don’t like reading, which is naturally at odds with the long detailed project specs that some of our projects require. It’s the instructor’s responsibility to make sure the important details stand out from the boring reference material.

In college, I also started eating lunch and dinner on the go regularly, which is something I had never really done before. It was a combination of factors that made walk-eat meals so enticing. The campus was big, so getting from place to place usually involved a bit of walking. Street level restaurants sold fast food in convenient form factors. Erratic scheduling meant it was unusual to be with another person for more than a couple hours at a time, so sit-down meals didn’t always make sense. I got really good at the different styles of walk-eating, from gnawing on burritos without spilling to balancing take-out boxes between my fingers. Eating on my feet felt like the most distilled honest form of feeding. It didn’t make sense to add so many extra steps to accomplish essentially the same task.

For a long time, the serving path of this website was too complicated. The serving path is all the things directly involved in delivering web pages to visitors, which can be contrasted with auxiliary systems that might be important, but aren’t immediately required for the website to work. I made my first website more than 12 years ago. Lots of things have changed since then, both with me and with the ecosystem of web development in general. So when I set out to reimplement RogerHub, I wanted to eliminate or replace as many parts of the serving path as I could.

Mission beach in San Francisco.

Let’s start with the basics. Most of the underlying infrastructure for RogerHub hasn’t changed since I moved the website from Linode to Google Cloud Platform in January 2016. I use Google’s Cloud DNS and HTTP(S) Load Balancing to initially capture inbound traffic. Google’s layer 7 load balancing provides global anycast, geographic load balancing, TLS termination, health checking, load scaling, and HTTP caching, among other things. It’s the best offering available at its low price point, so there’s little reason to consider alternatives1. However, HTTP(S) Load Balancing did have a 2 hour partial outage last October. I don’t get much traffic in October, but the incident made me start thinking of ways I could mitigate a similar outage in the future.

Behind the load balancer, RogerHub is served by a number of functionally identical backend servers (currently 2). These servers are fully capable of serving user traffic directly, but they currently only accept traffic from Google’s designated load balancer IP ranges. During peak seasons, I usually scale this number up to 3, but the extra server costs me about $50 a month, so I usually prefer to run just 2. These extra servers exist entirely for redundancy, not capacity. A single server could serve a hundred times peak load with no problem2.

Each server is a plain old Ubuntu host running an instance of PostgreSQL and an instance of the Hubnext application. That’s it. There’s no reverse proxy, no memcached, no mail server, and not even a provisioner (Hubnext knows how to install and configure its own dependencies). Hubnext itself can run in multiple modes, most of which are for maintenance and debugging. But my web servers start Hubnext in “application” mode. When Hubnext runs in application mode, it starts an ensemble of concurrent tasks, only one of which is the web server. The others are things like an unique ID allocator, peer to peer data syncing, system health checkers, maintenance daemons, and half a dozen kinds of in-memory caches. Hubnext can renew its own TLS certificate, create backups of its data, and even page me on my iPhone when things go wrong. Before Hubnext, these tasks were handled by haphazard cron jobs and independent services, which were set up and configured by a Puppet provisioner. Keeping all these tasks in a single application has made developing new features a lot simpler.

So why does Hubnext still need PostgreSQL? It’s true that Hubnext could have simply kept track of its own data, along with maintaining the various indexes that make common operations fast. But it’s an awful lot of work and unneeded complexity to implement something that a database already does for free. Of all the components of a traditional website’s architecture, I chose to keep the database, because I think PostgreSQL pulls its weight more than any of the other systems that Hubnext does supplant. That being said, Hubnext intentionally doesn’t use the transactions or replication that PostgreSQL provides (e.g. the parts of a database most sensitive to failure). Instead, Hubnext’s data model is designed to work without multi-statement transactions and Hubnext performs its own application level data replication, which is probably easier to configure and troubleshoot than database replication3.

When a request arrives at a Hubnext server, it gets wrapped in a tracker (for logging and metrics) and then enters the web request router. Instead of matching by request path, the Hubnext request router gives each registered web handler a chance to accept responsibility for a request before it falls through to the next handler. This allows for more complex and free-form routing logic. The web router starts with basic things like an OPTIONS handler and DoS rejection. The vast majority of requests are handled by the Hubnext blob cache and page cache. These handlers keep bounded in-memory caches of popular blobs (images, JavaScript, and CSS) and pre-rendered versions of popular pages. But even on a cache hit, the server still runs some code to process things like cache headers, ETags, and compression.

Blobs whose request path starts with /_version_ get long cache expiration times, which instructs Google Cloud CDN to cache them. Overall, about 40% of the requests to RogerHub are served from the CDN without ever reaching Hubnext. You can distinguish these cache hits by their Age header. ETags are generated from the SHA256 hash of the blob’s path and payload. Compression is applied to blobs based on a whitelist of compressible MIME types. The blob cache holds both the original and the gzipped versions of the payload, along with pre-computed ETags and other metadata, like the blob’s modification date.

All of the remaining requests4 eventually make their way to PostgreSQL for content. As I said in a previous blog post, Hubnext stores all of its content in PostgreSQL, including binary blobs like images and PDFs. This isn’t a problem, because Hubnext communicates with its database over a local unix socket and most blobs are cached in memory anyway. This does prevent Hubnext from using sendfile to deliver blobs like filesystem-based web servers do, but there aren’t many large files hosted on RogerHub anyway.

At this point, nearly all requests have already been served by one of the previously mentioned handlers. But the majority of Hubnext’s web server code is dedicated to serving the remaining fraction. This includes archive pages, search results, new comment submissions, the RSS feed, and legacy redirect handlers. If all else fails, Hubnext gives up and returns a 404. The response is sent back to the load balancer over HTTP/1.1 with TLS5 and then forwarded on to the user. Thus completes the request.

Is the serving path shorter than you imagined? I think so6. More code sometimes just means additional maintenance burden and a greater attack surface. But altogether, Hubnext actually consists of far less code than the code it replaces. Furthermore, it’s nearly all code that I wrote and it’s all written in a single language as part of a single system. Say what you want about Go, but I think it’s the best language for implementing traditional CRUD services, especially those with a web component7. Hubnext is a distillation of what it means to deliver RogerHub reliably and efficiently to its audience, without the frills and patches of a traditional website. Anyway, I hope this post was a good distraction. Until next time!

  1. Plus, I happen to work for Google (but I didn’t when I first signed up with GCP), so I hear a lot about these services at company meetings. ↩︎
  2. If required, Hubnext could theoretically run on 10-20 instances with no problem. But the peer to peer syncing protocol is meant to be a fully connected graph, so at some point, it might run into issues with too much network traffic. ↩︎
  3. Hubnext also doesn’t use PostgreSQL’s backup tools, because Hubnext can create application-specific backups that are more convenient and understandable than traditional database backups. ↩︎
  4. Some pages simply aren’t cached, like search result pages and 404s. ↩︎
  5. Google HTTP(S) Load Balancing uses HTTP/1.1 exclusively when forwarding requests to backend servers, although it prefers HTTP/2 on the front end. ↩︎
  6. But for several years, RogerHub just ran as PHP code on cheap shared web hosts, which makes even this setup seem overly complicated. But it’s actually quite a bit of work to run and maintain a modern PHP environment. You’ll probably need a dozen C extensions and a reverse proxy and perhaps a suite of cron jobs to back up all your files and databases. For a long time, I followed a bunch of announcement mailing lists to get news about security fixes in PHP and WordPress and MediaWiki and NGINX and MySQL and all the miscellaneous packages on ubuntu-security-announce. This new serving path means I really only need to keep an eye out on security announcements for Go (which tend to be rare and fairly mild) or OpenSSH or maybe PostgreSQL. ↩︎
  7. After all, that covers a lot of what Google does. ↩︎

Something else

I talk a lot about one day giving up computers and moving out, far, far away to a place where there aren’t any people, in order to become a lumberjack or a farmer. Well lately, it’s been not so much “talking” as instant messaging or just mumbling to myself. Plus, I don’t have the physique to do either of those things. My palms are soft and I’m prone to scraping my arms all the time. I like discipline as a virtue, but I also don’t really like working. And finally, my hobby is proposing stupid outlandish half-joking-half-serious expensive irresponsible plans. It’s fun, and I guess you can’t really say something’s a bad idea until you’ve thought it through yourself.

Joking aside, my motivation comes from truth. Computers are terrible. It becomes more and more apparent to me every year that passes. I want to get far, far away from them. They’re the source of most of my problems1. And lately, it has become clear that lots of other people feel the same way.

Terribleness isn’t the computer’s fault. It all comes from a fundamental disconnect between what people think their computers are and the terrible reality of it all. A massive fraction of digital data is at constant risk of being lost forever, because it’s stored without redundancy on consumer devices scraped from the bottom of the barrel. Critical security vulnerabilities lurk in every computer system. Most of the time, they aren’t discovered simply because nobody has bothered looking hard enough. But given the lightest push, so much software2 just falls apart. You can break all kinds of software just by telling it to do two things at once. Load an article and swipe back at the same time? Bam! Your reddit app UI is now unusable. Submit the same web form from two different tabs? Ta-da! Your session state is now corrupt.

At this moment, my Comcast account is fucked beyond repair after a series of service cancellations, reinstatements, and billing credits put it in an unusual state. I needed to order internet service at my next apartment. I couldn’t do it on the website, probably because of the previous issues. I also didn’t think it was worthwhile explaining to yet another customer service rep over why my account looked the way it did. So, I just made a new account and ordered it there3.

Boats at Fisherman's Wharf in Monterey Bay.

You can get rid of all these problems if you try hard enough. Some people simply don’t own any data. Their email and photos and documents and (nonexistent) depositories of old code just come and go along with the devices they’re stored on. They don’t have data at risk of compromise, because they don’t have any digital secrets and important computer systems (like their bank) just have humans double-checking things every step of the way.

Alternatively, you could put expiration dates on your data. Email keeps for 2 years. Photos: five. Sort your data into annual folders, and when the expiration date passes, simply delete it. This strategy takes the focus off of maintaining perfect data integrity and opens up a new method to measure your success. At the end of the year, if you never had trouble finding an old photo and your credit score was doing alright, then declare success.

Maybe you could live somewhere where people don’t really need computers. Maybe they have a phone that only makes calls and a nice big TV and maybe some books and paper and stuff to do outside. Or maybe they have iPads too—little TV’s that you can touch.

Or you could simply come to terms with the way things are. After all, your data only needs to last as long as your own body does.

A rusty old tractor.

I stayed on a farm over Memorial Day weekend. It was owned by a couple. They had a tractor and some trucks and dogs and a jaccuzi and two little huts to rent out on Airbnb. I wonder if they had computers at their farm, or if they just had iPads.

Do they enjoy having a lot of land? Or is it wearisome waking up to the same rusting trucks and dirt roads every day? They probably have a lot of privacy, on account of their lack of upstairs, downstairs, and adjacent neighbors in their non-apartment home. Although, that’s probably less true if internet strangers are renting out their cabins all the time.

They had rows of crops, like the ones you see along highway 5 in central California. I wasn’t sure exactly who they belonged to, since their nearest neighbor was a solid 10 minute walk down the road. I wonder if they’re all friends.

My Subaru Impreza parked outside a cabin.

I spend a lot of my weekends working on my personal infrastructure. I can’t quite explain what it is I’m working on or why it needs work at all. It’s a combination of data management, software sandboxing, and build systems. At some point, I have to wonder if I’m spending more time working on my tools than actually using those tools to do stuff.

I also have a bunch of to-do’s on my Todoist account. The length of those lists has grown considerably since I graduated college a year ago4. I can’t believe it’s already been more than a year since then. How soon will it be two years? Did I live this past year right? Did I accomplish enough things?

Fields in Salinas.

I guess the answer really depends on how you measure “enough”. A year ago, I told myself that I wanted to become the best computer programmer that I could be. I feel like I’ve made lots of progress on that front. I picked up lots of good habits at work and my personal infrastructure is better than it’s ever been. I did a lot of cool stuff with RogerHub, and it finally has infrastructure that I can be proud of. But somehow, I’m also more dissatisfied with my computing than ever before.

I’m still living in the same area, and I’ve just committed to living here for yet another year. My plan had always been to stay in the Bay Area for three or four years, and then move away. But where to, and what for? Right now, I can’t even imagine what it’d be like to leave California5.

I’ve gotten better at cooking. More specifically, I think I’ve gotten more familiar with what kinds of food I like. On most days, I cook dinner for myself at home, and then I watch YouTube and browse the internet until I have to go to sleep. If it’s May or December, I also have to answer a few math questions.

But I haven’t made any new friends in the last year, other than my direct coworkers. I’ve spent more free time at home than ever before. I spend a small fortune on rent, after all. I might as well try to get some value out of it.

Bread and strawberries on a table.

Three years ago, I remember thinking that I didn’t know any adults who were actually happy, from which I concluded that growing up just kind of sucked.

I rarely drive anymore. I used to think driving at night on empty roads was nice and relaxing. But the roads around here just feel crowded and dangerous.

I’ll continue to work on computers, because that’s all I’m good at. More often than not, the last thought in my head before I fall asleep is an infinite grid of computers. On the bright side, I fixed my chair today and I bought a new toilet seat. As long as there’s a trickle of new purchases arriving at my doorstep, maybe things will be alright.

This post was kind of a downer. I’ll return with more website infrastructure next week6.

And since I haven’t posted a photo of me in a while, here’s a recent one:

Roger sitting by the ocean.

The photo on my home page doesn’t even look like me anymore. But I like it as an abstract icon, so it’ll probably stay there for a while, until I change it to something else.

  1. Which, I suppose, is pretty fortunate compared to the alternative. ↩︎
  2. You might blame software instead of the computer itself, but to most people, they’re the same thing. ↩︎
  3. I’m grateful I have such easy access to additional email addresses and phone numbers. ↩︎
  4. In fact, I only recently discovered that Todoist imposes a 200 task limit per project. ↩︎
  5. In any case, I realized this past weekend that I have way too many physical belongings to haul around, if I ever want to move. ↩︎
  6. I backdated this to July 31st, since that’s when I wrote most of the content. ↩︎

Training angst

Have you ever used Incognito Mode because you wanted to search for something weird, but you didn’t want it showing up in targeted ads? Or have you ever withheld a Like from a YouTube video, because although you enjoyed watching it, you weren’t really interested in being recommended more of the same? I have. And since I can’t hear you, I’ll assume you probably have too. People have gotten accustomed to the idea of “training” their computers to behave how they want, much like you’d train a dog or your nephew. And whether you study computer science or psychology or ecology or dog stuff, the principles of reinforcement learning are all about the same.

The reason you don’t search weird stuff while logged in or thumbs-up everything indiscriminately is that you’re trying to avoid setting the wrong example. But occasional slip ups are a fact of life. To compensate, many machine learning models offer mechanisms to deal with erroneous (”noisy”) labels in training data. The constraints of a soft margin SVM include a slack term that represents a trade-off between classification accuracy and resilience against mislabeled examples. Because computers can’t tell which of its training examples are mislabeled and which are simply unusual, it does the next best thing: each example can be rated based on how “believable” it is in comparison to other examples. Then, finding the optimal parameters is simply a matter of minimizing unbelievability.

Avoiding bad examples is in your best interest if you want the algorithm to give you the best recommendations. So, your YouTube Liked Videos list is probably only a rough approximation of the videos that you actually like1. Now, a computer algorithm won’t mind if you lie (a little) to it. But the real tragedy is that as a side effect, YouTube has effectively trained you, the user, to give Likes not to the videos you actually like, but to the kinds of videos you want recommendations for. In fact, most kinds of behavioral training induce this kind of reverse effect. The trainer lies a little bit to the trainee, in order to push the training in the right direction. And in return, the trainer drifts a little farther from the truth.

Parents do this to their children. Friends do it to their friends. Even if you try to be honest, the words you say and the reactions you make end up deviating ever so slightly from the truth, because you can’t help but think that your actions will end up somebody’s brain as subconscious behavioral training examples2. If your friend invites you to do something you don’t want to do, maybe you’ll say yes, or else they might not even ask next time. And if they say something you don’t like, maybe you’ll act angrier than you really are, so they won’t mention it ever again. Every decision starts with “how do I feel about this?”, but is quickly followed up with “how will others feel about my feeling about this?”. This isn’t plain old empathy. It’s true that human-to-human behavioral training helps people get along with each other. But when our words and actions are influenced by how we think they’ll affect someone else’s behavior, they end up being fundamentally just another form of lying. And unlike a computer recommendation algorithm, people might actually hate you for lying to them.

This has been weighing on my mind for a long time. I think it’s unbearably hard for adults to be emotionally honest with each other, even for close friends or family. But the problem isn’t with the words we say. Of course, people want others to think of them a certain way, whether it’s about your money or your job or your passions or romance or mental health. And people lie about those things all the time. That isn’t what keeps me up at night. What bothers me is that even when you’re trying your best to be absolutely honest with someone, you can’t. You say the right words, but they don’t sound right. You feel the right feelings, but your face isn’t cooperating. Your eyes get hazy from years of emotional cruft, and you’re no longer able to really see the person right in front of you. And it’s all because we spend every day training each other with almost-truths.

A flower.

  1. The Like button is actually short for the “recommend me more videos Like this one” button. ↩︎
  2. Have you watched Bojack Horseman? ↩︎

Data loss and you

My laptop’s hard drive crashed in 2012. I was on campus walking by Evans Hall, when I took my recently-purchased Thinkpad x230 out of my backpack to look up a map (didn’t have a smartphone), only to realize it wouldn’t boot. This wasn’t a disaster by any means. It set me back $200 to rush-order a new 256GB Crucial M4 SSD. But since I regularly backed up my data to an old desktop running at my parent’s house, I was able to restore almost everything once I received it1.

I never figured out why my almost-new laptop’s hard drive stopped working out of the blue. The drive still spun up, yet the system didn’t detect it. But whether it was the connector or the circuit board, that isn’t the point. Hardware fails all the time for no reason2, and you should be prepared for when it happens.

Data management has changed a lot in the last ten years, primarily driven by the growing popularity of SaaS (”cloud”) storage and greatly improved network capacity. But one thing that hasn’t changed is that most people are still unprepared for hardware failure when it comes to their personal data. Humans start manufacturing data from the moment they’re born. Kids should really be taught data husbandry, just like they’re taught about taxes and college admissions and health stuff. But anyway, here are a few things I’ve learned about managing data that I want to share:

Identify what’s important

Data management doesn’t work if you don’t know what you’re managing. In other words, what data would make you sad if you lost access to it? Every day, your computer handles massive amounts of garbage data: website assets, Netflix videos, application logs, PDFs of academic research, etc. There’s also the data that you produce, but don’t intend to keep long-term: dash cam and surveillance footage (it’s too big), your computer settings (it’s easy to re-create), or your phone’s location history (it’s too much of a hassle to extract).

For most people, important data is the data that’s irreplaceable. It’s your photos, your notes and documents, your email, your tax forms, and (if you’re a programmer) your enormous collection of personal source code.

Consider the threats

It’s impossible to predict every possible bad thing that could happen to your data. But fortunately, you don’t have to! You can safely ignore all the potential data disasters that are significantly less likely to occur than your own untimely death3. That leaves behind a few possibilities, roughly in order of decreasing likelihood:

  • Hardware failure
  • Malicious data loss (somebody deletes your shit)
  • Accidental data loss (you delete your shit)
  • Data breach (somebody leaks your shit)
  • Undetected data degradation

Hardware failures are the easiest to understand. Hard drives (external hard drives included), solid state drives, USB thumb drives, and memory cards all have an approximate “lifespan”, after which they tend to fail catastrophically4. The rule of thumb is 3 years for external hard drives, 5 years for internal hard drives, and perhaps 10 years for enterprise-grade hard drives.

Malicious data loss has become much more common these days, with the rise of a digital extortion scheme known as “ransomware”. Ransomeware encrypts user files on an infected machine, usually using public-key cryptography in at least one of the steps. The encryption is designed so that the infected computer can encrypt files easily, but is unable to reverse the encryption without the attacker’s cooperation (which is usually made available in exchange for a fee). Fortunately, ransomeware is easily detectable, because the infected computer prompts you for money once the data loss is complete.

On the other hand, accidental data loss can occur without anybody noticing. If you’ve ever accidentally overwritten or deleted a file, you’ve experienced accidental data loss. Because it can take months or years before accidental data loss is noticed, simple backups are sometimes ineffective against it.

Data breaches are a unique kind of data loss, because it doesn’t necessarily mean you’ve lost access to the data yourself. Some kinds of data (passwords, tax documents, government identification cards) lose their value when they become available to attackers. So, your data management strategy should also identify if some of your data is condential.

Undetected data degradation (or “bit rot”) occurs when your data becomes corrupted (either by software bugs or by forces of nature) without you noticing. Modern disk controllers and file systems can provide some defense against bit rot (for example, in the case of a bad sectors on a hard disk). But the possibility remains, and any good backup strategy needs a way to detect errors in the data (and also to fix them).

Things you can’t backup

Backups and redundancy are generally the solutions to data loss. But you should be aware that there are some things you simply can’t backup. For example:

  • Data you interact with, but can’t export. For example, your comments on social media would be difficult to backup.
  • Data that’s useless (or less useful) outside of the context of a SaaS application. For example, you can export your Google Docs as PDFs or Microsoft Word files, but then they’re no longer Google Docs.

Redundancy vs backup

Redundancy is buying 2 external hard drives, then saving your data to both. If either hard drive experiences a mechanical failure, you’ll still have a 2nd copy. But this isn’t a backup.

If you mistakenly overwrite or delete an important file on one hard drive, you’ll probably do the same on the other hard drive. In a sense, backups require the extra dimension of time. There needs to be either a time delay in when your data propagates to the backup copy, or better yet, your backup needs to maintain multiple versions of your data over time.

RAID and erasure encoding both offer redundancy, but do not count as a backup.

Backups vs archives

Backups are easier if you have less data. You can create archives of old data (simple ZIP archives will do) and back them up separately from your “live” data. Archives make your daily backups faster and also make it easier to perform data scrubbing.

When you’re archiving data, you should pick an archive format that will still be readable in 30 to 50 years. Proprietary and non-standard archive tools might fall out of popularity and become totally unusable in just 10 or 15 years.

Data scrubbing

One way to protect against bit rot is to check it periodically against known-good versions. For example, if you store cryptographic checksums with your files (and also digitally sign the checksums), you can verify the checksums at any time and detect bit rot. Make sure you have redundant copies of your data, so that you can restore corrupted files if you detect errors.

I generate SHA1 checksums for my archives and sign the checksums with my GPG key.

Failure domain

If your backup solution is 2 copies on the same hard drive, or 2 hard drives in the same computer, or 2 computers in the same house, then you’re consolidating your failure domain. If your computer experiences an electrical fire or your house burns down, then you’ve just lost all copies of your data.

Onsite vs offsite backups

Most people keep all their data within a 20 meter radius of their primary desktop computer. If all of your backups are onsite (e.g. in your home), then a physical disaster could eliminate all of the copies. The solution is to use offsite backups, either by using cloud storage (easy) or by stashing your backups at a friend’s house (pain in the SaaS).

Online vs offline backups

If a malicious attacker gains access to your system, they can delete your data. But they can also delete any cloud backups5 and external hard drive backups that are accessible from your computer. It’s sometimes useful to keep backups of your data that aren’t immediately deletable, either because they’re powered off (like an unplugged external hard drive) or because they’re read-only media (like data backups on Blu-ray Discs).


You can reduce your risk of data leaks by applying encryption to your data. Good encryption schemes are automatic (you shouldn’t need to encrypt each file manually) and thoroughly audited by the infosec community. And while you’re at it, you should make use of your operating system’s full disk encryption capabilities (FileVault on macOS, BitLocker on Windows, and LUKS or whatever on Linux).

Encrypting your backups also means that you could lose access to them if you lose your encryption credentials. So, make sure you understand how to recover your encryption credentials, even if your computer is destroyed.

Online account security

If you’re considering cloud backups, you should also take steps to strengthen the security of your account:

  • Use a long password, and don’t re-use a password you’ve used on a different website.
  • Consider using a passphrase (a regular english sentence containing at least 4-5 uncommon words). Don’t share similar passphrases for multiple services (like “my facebook password”), because an attacker with access to the plaintext can easily guess the scheme.
  • Turn on two-factor authentication. The most common 2FA scheme (TOTP) requires you to type in a 6-8 digit code whenever you log in. You should prefer to use a mobile app (I recommend Authy) to generate the code, rather than to receive the code via SMS. Don’t forget to generate backup codes and store them in a physically secure top-secret location (e.g. underneath the kitchen sink).
  • If you’re asked to set security questions, don’t use real answers (they’re too easy to guess). Make up gibberish answers and write them down somewhere (preferably a password manager).
  • If your account password can be recovered via email, make sure your email account is also secure.

Capacity vs throughput

One strong disadvantage of cloud backups is that transfers are limited to the speed of your home internet, especially for large uploads. Backups are less useful when they take days or weeks to restore, so be aware of how your backup throughput affects your data management strategy.

This problem also applies to high-capacity microSD cards and hard drives. It can take several days to fully read or write a 10TB data archival hard drive. Sometimes, smaller but faster solid state drives are well worth the investment.

File system features

Most people think of backups as “copies of their files”. But the precise definition of a “file” has evolved rapidly just as computers have. File systems have become very complex to meet the increasing demands of modern computer applications. But the truth remains that most programs (and most users) don’t care about most of those features.

For most people, your “files” refers to (1) the directory-file tree and (2) the bytes contained in each file. Some people also care about file modification times. If you’re a computer programmer, you probably care about file permission bits (perhaps just the executable bit) and maybe symbolic links.

But consider this (non-exhaustive) list of filesystem features, and whether you think they need to be part of your data backups:

  • Capitalization of file and directory names
  • File owner (uid/gid) and permission bits, including SUID and sticky bits
  • File ACLs, especially in an enterprise environment
  • File access time, modification time, and creation time
  • Extended attributes (web quarantine, tags, “hidden”, and “locked”)
  • Resource forks, on macOS computers
  • Non-regular files (sockets, pipes, character/block devices)
  • Hard links (also “aliases” or “junctions”)
  • Executable capabilities (maybe just CAP_NET_BIND_SERVICE?)

If your answer is no, no, no, no, no, what?, no, no, and no, then great! The majority of cloud storage tools will work just fine for you. But the unfortunate truth is that most computer programmers are completely unaware of many of these file system features. So, they write software that completely ignores them.

Programs and settings

Programs and settings are often left out of backup schemes. Most people don’t have a problem reconfiguring their computer once in a while, because catastrophic failures are unlikely. If you’re interested in creating backups of your programs, consider finding a package manager for your preferred operating system. Computer settings can usually be backed up with a combination of group policy magic for Windows and config files or /usr/bin/defaults for macOS.

Application-specific backup

If you’re backing up data for an application that uses a database or a complex file-system hierarchy, then you might be better served by an backup system that’s designed specifically for that application. For example, RogerHub runs on a PostgreSQL database, which comes with its own backup tools. But RogerHub uses an application-specific backup scheme that’s tailored to RogerHub specifically.


A backup isn’t a backup until you’ve tested the restoration process.


If you’ve just skipped to the end to read my recommendations, fantastic! You’re in great company. Here’s what I suggest for most people:

  • Use cloud services instead of files, to whatever extent you feel comfortable with. It’s most likely not worth your time to backup email or photos, since you could use Google Inbox or Google Photos instead.
  • Create backups of your files regularly, using the 3-2-1 rule: 3 copies of your data, on 2 different types of media, with at least 1 offsite backup. For example, keep your data on your computer. Then, back it up to an online cloud storage or cloud backup service. Finally, back up your data periodically to an external hard drive.
  • Don’t trust physical hardware. It doesn’t matter how much you paid for it. It doesn’t matter if it’s brand new or if you got the most advanced model. Hardware breaks all the time in the most unpredictable ways.
  • Don’t buy an external hard drive or a NAS as your primary backup destination. They’re probably no more reliable than your own computer.
  • Make sure to use full-disk encryption and encrypted backups.
  • Make sure nobody can maliciously (or accidentally) delete all of your backups, simply by compromising your primary computer.
  • Consider making archives of data that you use infrequently and no longer intend to modify.
  • Secure your online accounts (see section titled “Online account security”)
  • Pat yourself on the back and take a break once in a while. Data management is hard stuff!

If you find any mistakes on this page, let me know. I want to keep it somewhat updated.

And, here’s yet another photo:


  1. My laptop contained the only copy of my finished yet unsubmitted class project. But technically I had a project partner. We didn’t actually work together on projects. We both finished each project independently, then just picked one version to submit. ↩︎
  2. About four and a half years later, that m4 stopped working and I ordered a MX300 to replace it. ↩︎
  3. That is, unless you’re interested in leaving behind a postmortem legacy. ↩︎
  4. There are other modes of failure other than total catastrophic failure. ↩︎
  5. Technically, most reputable cloud storage companies will keep your data for some time even after you delete it. If you really wanted to, you could explain the situation to your cloud provider, and they’ll probably be able to recover your cloud backups. ↩︎

Life lessons from artificial intelligence

If you speak to enough software engineers, you’ll realize that many of them can’t understand some everyday ideas without using computer metaphors. They say “context switching” to explain why it’s hard to work with interruptions and distractions. Empathy is essentially machine virtualization, but applied to other people’s brains. Practicing a skill is basically feedback-directed optimization. Motion sickness is just your video processor overheating, and so on.

A few years ago, I thought I was the only one whose brain used “computer” as its native language. And at the time, I considered this a major problem. I remember one summer afternoon, I was playing scrabble with some friends at my parents’ house. At that time, I had just finished an internship, where day-to-night I didn’t have much to think about other than computers. And as I stared at my scrabble tiles, I realized the only word I could think of was EEPROM1.

It was time to fix things. I started reading more. I’ve carried a Kindle in my backpack since I got my first Kindle2 in high school, but I haven’t always used it regularly. It’s loaded with a bunch of novels. I don’t like non-fiction, especially the popular non-fiction about famous politicians and the economy and how to manage your time. It seems like a waste of time to read about reality, when make-believe is so much more interesting.

I also started watching more anime. I especially like the ones where the main character has a professional skill and that skill becomes a inextricable part of their personal identity3. During my last semester in college, I thought really hard about whether I really wanted to just be a computer programmer until I die, or whether I simply had no other choice, because I wasn’t good at anything else. And so, I watched Hibike! Euphonium obsessively, searching for answers.

Devoting your life to a skill can be frustrating. It makes you wonder if you’d be a completely different person if that part of you were suddenly ripped away. And then there’s the creeping realization that your childhood passion is slowly turning into yet another boring adult job. It’s like when you’re a kid and you want to be the strongest ninja in your village, but then you grow up and start working as a mercenary. You can still do ninja stuff all day, but it’s just not fun anymore.

But I like those shows because it’s inspiring and refreshing to watch characters who really care about being really good at something, as long as that something isn’t just “make a ton of money”. I think it’s important to have passion and a competitive spirit for at least one thing. It’s no fun being just mediocre at a bunch of things. Plus, being good at something gives you a unique perspective on the world, and that perspective comes with insights worth sharing.

I thought a lot about Q-learning during the months after my car accident. I think normal people are generally unprepared to respond rationally in crisis situations. And that’s at least partially because most of us haven’t spent enough time evaluating the relative cost of all the different terrible things that might happen to us on a day to day basis. Q-learning is a technique for decision-making that relies on predicting the expected value of taking an action in a particular state. In order for Q-learning to work, you need models for both the state transitions (what could happen if I take this action?) and a cost for each of the outcomes. If you understand the transitions, but all of your costs are just “really bad, don’t let that happen”, then in a pinch, it becomes difficult to decide which bad outcome is the least terrible.

There are little nuggets of philosophy embedded all over the fields of artificial intelligence and machine learning. I skipped a lot of class in college, but I never skipped my introductory AI and ML classes. It turns out that machine learning and human learning have a lot in common. Here are some more ideas, inspired by artificial intelligence:

I try to spend as little time as possible shopping around before buying something, and that’s partially because of what’s called the Optimizer’s Curse4. The idea goes like this: Before buying something, you usually look at all your options and pick the best one. Since people aren’t perfect, sometimes you overestimate or underestimate how good your options are. The more options you consider, the higher the probability that the perceived value of your best option will be much greater than its actual value. Then, you end up feeling disappointed, because you bought something that isn’t as good as you thought it’d be.

Now that doesn’t mean you should just buy the first thing you see, since your first option might turn out to be really shitty. But if you’re reasonably satisfied with your options, it’s probably best to stop looking and just make your choice.

But artificial intelligence also tells us that it’s not smart to always pick the best option. Stochastic optimization methods are based on the idea that, sometimes, you should take suboptimal actions just to experience the possibilities. Humans call this “stepping out of your comfort zone”. Machines need to strike a balance between “exploration” (trying out different options to see what happens) and “exploitation” (using experience to make good decisions) in order to succeed in the long run. This balance is called the “learning rate”, and a good learning rate decreases over time. In other words, young people are supposed to make poor decisions and try new things, but once you get old, you should settle down5.

The difference in cumulative value resulting from sub-optimal decisions is known as “regret”. In the long run, machines should learn the optimal policy for decision-making. But machines should also try to reach this optimum with as little regret as possible. This is accomplished by adjusting the learning rate.

So is it wrong for parents to make all of their children’s decisions? A little guidance is probably valuable, but a too conservative learning rate converges to a suboptimal long-term policy6. I suppose kids should act like kids, and if they scrape their knees and do stupid stuff and get in trouble, that’s probably okay.

Anyway, there’s one more artificial intelligence technique that I don’t understand too well, but it comes with interesting implications for humans. It’s a technique for path planning applied to finite LQR problems, which are a type of problem where the system mechanics can be described linearly and the cost function is quadratic with the state. These restrictions yield a formulation that lets us compute a policy that is independent of the state of the system. In other words, the machine plans a path by starting at the goal, then working backward to determine what leads up to that goal.

The same policy can be applied no matter your goal (”terminal condition”), because all the mechanics of the system are encoded in the policy. For example, if your goal is to build rockets at NASA, then it’s useful to consider what needs to happen one day, one month, or even one year before your dream comes true. The policy becomes less and less useful when the distance to your goal increases, but by working backward far enough, you can figure out what to do tomorrow to take the first step.

And if your plans don’t work out, well don’t worry, because the policy is independent of the state of the system. You can reevaluate your trajectory at any point to put yourself back on the right track7.

I miss learning signal processing and computer graphics and machine learning and all of these classes with a lot of math in them. I work on infrastructure and networking at work, which is supposedly my specialization. But I also feel like I’m missing out on a lot of great stuff that I used to be interested in. The math-heavy computer science courses always felt a little more legit. I always imagined college to be a lot of handwriting and equations and stuff. Maybe I’ll pick up another side project for this stuff soon.

And here’s a photo of the hard disk from my first laptop:

A hard disk lying on some leaves.

It died less than a month after I got the laptop. After that, I started backing up my data more religiously. Plus, I replaced the spinning rust with a new Crucial M4 and that lasted for about 4.5 years until it broke too. I still kept this hard drive chassis and platter, because it looks cool.

  1. Acronyms aren’t allowed anyway. ↩︎
  2. My first Kindle was a 3rd generation Kindle Keyboard. When I broke that one, I bought another Kindle Keyboard even though a newer model had been released. I didn’t want my parents to notice I had broken my Kindle so soon after I got it, so I hid the old Kindle in a manilla envelope and used its adopted brother instead. Three years later, I upgraded to the Paperwhite, and that’s still in my backpack today. ↩︎
  3. See this or this. ↩︎
  4. But also partially because I’m a lazy bastard. ↩︎
  5. And yet, I haven’t left my apartment all weekend. ↩︎
  6. PAaaS: parenting advice as a service. ↩︎
  7. On second thought, this doesn’t have much to do with artificial intelligence. ↩︎

The data model of Hubnext

I got my first computer when I was 8. It was made out of this beige-white plastic and ran a (possibly bootlegged) copy of Windows ME1. Since our house had recently gotten DSL installed, the internet could be on 24 hours a day without tying up the phone line. But I didn’t care about that. I was perfectly content browsing through each of the menus in Control Panel and rearranging the files in My Documents. As long as I was in front of a computer screen, I felt like I was in my element and everything was going to be alright.

Computers have come a long way. Today, you can rent jiggabytes of data storage for literally pennies per month (and yet iPhone users still constantly run out of space to save photos). For most people living in advanced capitalist societies, storage capacity has been permanently eliminated as a reason why you might consider deleting any data at all. For people working in tech, there’s a mindset known as “big data”, where businesses blindly hoard all of their data in the hope that some of it will become useful at some time in the future.

On the other hand, I’m a fan of “small data”. It’s the realization that, for many practical applications, the amount of useful you have is dwarfed by the overwhelming computing and storage capacity of modern computers. It really doesn’t matter how inefficient or primitive your programs are, and that opens up a world of opportunities for most folks to do ridiculous audacious things with their data2.

When RogerHub ran on WordPress, I set up master-slave database and filesystem replication for my primary and replica web backends. WordPress needs to support all kinds of ancient shared hosting environments, so WordPress core makes very few assumptions about its operating environment. But WordPress plugins, on the other hand, typically make a lot of assumptions about what kinds of things the web server is allowed to do3. So the only way to really run WordPress in a highly-available configuration is to treat it like a black box and try your best to synchronize the database and filesystem underneath it.

RogerHub has no need for all of that complexity. RogerHub is small data. Its 38,000 comments could fit in the system memory of my first cellphone4 and the blobs could easily fit in the included external microSD card. But perhaps more important than the size of the data is how simple RogerHub’s dataset is.

Database replication comes with its own complexities, because it assumes you actually need transaction semantics5. Filesystem replication is mostly a crapshoot with no meaningful conflict resolution strategy for applications that use disk like a lock server. But RogerHub really only collects one kind of data: comments. The nice thing about my comments is that they have no relationship to each other. You can’t reply directly to other comments. Adding a new comment is as simple as inserting it in chronological order. So theoretically, all of this conflict resolution mumbo jumbo should be completely unnecessary.

I call the new version of RogerHub “hubnext” internally6. Hubnext stores all kinds of data: comments, pages, templates7, blobs8, and even internal data, like custom redirects and web certificates. Altogether, these different kinds of data are just called “Things”.

One special feature of hubnext is that you can’t modify or delete a Thing, once it has been created (e.g. an append-only data store). This property makes it really easy to synchronize multiple sets of Things on different servers, since each replica of the hubnext software just needs to figure out which of its Things the other replicas don’t have. To make synchronization easier, each Thing is given a unique identifier, so hubnext replicas can talk about their Things by just using their IDs.

Each hubnext replica keeps a list of all known Thing IDs in memory. It also keeps a rolling set hash of the IDs. It needs to be a rolling hash, so that it’s fast to compute H(n1, n2, …, nk, nk+1), given H(n1, n2, …, nk) and nk+1. And it needs to be a set hash, so that the order of the elements doesn’t matter. When a new ID set added to the list of Thing IDs, the hubnext replica computes the updated hash, but it also remembers the old hash, as well as the ID that triggered the change. By remembering the last N old hashes and the corresponding Thing IDs, hubnext builds a “trail of breadcrumbs” of the most recently added IDs. When a hubnext replica wants to sync with a peer, it sends its latest N hashes through a secure channel. The peer searches for the most recent matching hash that’s in both the requester’s hashes and the peer’s own latest N hashes. If a match is found, then the peer can use its breadcrumbs to generate a “delta” of newly added IDs and return them back to the requester. And if a match isn’t found, the default behavior is to assume the delta should include the entire set of all Thing IDs.

This algorithm runs periodically on all hubnext replicas. It’s optimized for the most common case, where all replicas have identical sets of Thing IDs, but it also works well for highly unusual cases (for example, when a new hubnext replica joins the cluster). But most of the time, this algorithm is completely unnecessary. Most writes (like new comments, new blog posts, etc) are synchronously pushed to all replicas simultaneously, so they become visible to all users globally without any delay. The synchronization algorithm is mostly for bootstrapping a new replica or catching up after some network/host downtime.

To make sure that every Thing has a unique ID, the cluster also runs a separate algorithm to allocate chunks of IDs to each hubnext replica. The ID allocation algorithm is an optimistic majority consensus one-phase commit with randomized exponential backoff. When a hubnext replica needs a chunk of new IDs, it proposes a desired ID range to each of its peers. If more than half of the peers accept the allocation, then hubnext adds the range to its pool of available IDs. If the peers reject the allocation, then hubnext just waits a while and tries again. Hubnext doesn’t make an attempt to release partially allocated IDs, because collisions are rare and we can afford to be wasteful. To decide whether to accept or reject an allocation, each peer only needs to keep track of one 64-bit ID, representing the largest known allocated ID. And to make the algorithm more efficient, rejections will include the largest known allocated ID as a “hint” for the requester.

There are some obvious problems with using an append-only set to serve website content directly. To address these issue, each Thing type contains (1) a “last modified” timestamp and (2) some unique identifier that links together multiple versions of the same thing. For blobs and pages, the identifier is the canonicalized URL. For templates, it’s the template’s name. For comments, it’s the Thing ID of the first version of the comment. When the website needs to fetch some website content, it only considers the instance of the data with the latest “last modified” timestamp among multiple Things with the same identifier.

Overall, I’m really satisfied with how this data storage model turned out. It makes a lot of things easier, like website backups, importing/exporting data, and publishing new website content. I intentionally glossed over the database indexing magic that makes all of this somewhat efficient, but that’s nonetheless present. There’s also an in-memory caching layer for the most commonly-requested content (like static versions of popular web pages and assets). Plus, there’s some Google Cloud CDN magic in the mix too.

It’s somewhat unusual to store static assets (like images and javascript) in a relational database. The only reason why I can get away with it is because RogerHub is small data. The only user-produced content is plaintext comments, and I don’t upload nearly enough images to fill up even the smallest GCE instances.

Anyway, have a nice Friday. If I find another interesting topic about Hubnext, I’ll probably write another blog post like this one soon.

A bridge in Kamikochi, Japan.

  1. But not for long, because I found install disks for Windows 2000 and XP in the garage and decided to install those. ↩︎
  2. I once made a project grading system for a class I TA’ed in college. It ran on a SQLite database with a single global database lock, because that was plenty fast for everybody. ↩︎
  3. Things like writing to any location in the web root and assuming that filesystem locks are real global locks. ↩︎
  4. a Nokia 5300 with 32MB of internal flash ↩︎
  5. I’ve never actually seen any WordPress code try to use a transaction. ↩︎
  6. Does “internally” even mean anything if it’s just me? ↩︎
  7. Templates determine how different pages look and feel. ↩︎
  8. Images, stylesheets, etc. ↩︎

What’s “next” for RogerHub

Did I intentionally use 3 different smart quotes in the title? You bet I did! But did it require a few trips to and some Python to figure out what the proper octal escape sequences are? As a matter of fact, yes. Yes it did. And if you’re wondering, they’re \342\200\231, \342\200\234, and \342\200\2351.

The last time I rewrote was in November of 2010, more than 6 years ago. Before that, I was using this PHP/MySQL blogging software that I wrote myself. RogerHub ran on cheap shared hosting that cost $44 USD per year. I moved the site to WordPress because I was tired of writing basic features (RSS feeds, caching, comments, etc.) myself. The whole migration process took about a week. That includes translating my blog theme to WordPress, exporting all my articles2, and setting up WordPress via 2000s-era web control panels and FTP.

Maybe it’s that time again? The time when I’m unhappy with my website and need to do something drastic to change things up.

To be fair, my “personal blog” doesn’t really feel like a blog anymore. Since RogerHub now gets anywhere between 217 to 221 visitors per month, it demands a lot more of my attention than a personal blog really should. During final exam season, I log onto my website every night to collect my reward: a day’s worth of final exam questions and outdated memes3. Meanwhile, I wrote 3 blog posts last year and just 1 the year before that.

I want to take back my blog. And I want to strategically reduce the amount of time I spend managing the comments section without eliminating them altogether. Lately I’ve been too scared to make changes to my blog, because of how it might break other parts of the site. On top of that, I have to build everything within the framework of WordPress, an enormous piece of software written by strangers in a language that gives me no pleasure to use. I miss when it didn’t matter if I broke everything for a few hours, because I was editing my site directly in production over FTP. And every time WordPress announces a new vulnerability in some JSON API or media attachments (all features that I don’t use), I miss running a website where I owned all of the code.

So on nights and weekends over the last 5 months, I’ve been working on a complete rewrite of RogerHub from the ground up. And you’re looking at it right now.

Why does it look exactly the same as before? Well, I lied. I didn’t rewrite the frontend or any of the website’s pages. But all the stuff under the hood that’s responsible for delivering this website to your eyeballs has been replaced with entirely new code4.

The rewrite replaces WordPress, NGINX, HHVM, Puppet, MySQL, and all the miscellaneous Python and Bash scripts that I used to maintain the website. RogerHub is now just a single Go program, running on 3 GCE instances, each with a PostgreSQL database, fronted by Google Cloud Load Balancer.

Although this website looks the same, I’ve made a ton of improvements behind the scenes that’ll make it easier for me to add features with confidence and reduce the amount of toil I perform to maintain the site. I’ll probably write more about the specifics of what’s new, but one of the most important things is that I can now easily run a local version of RogerHub in my apartment to test out new changes before pushing them live5. I’ve also greatly improved my rollout and rollback processes for new website code and configuration.

Does this mean I’ll start writing blogs again? Sure, probably.

I’m not done with the changes. I’ve only just finished the features that I thought were mandatory before I could migrate the live site over to the new system. I performed the migration last night and I’ve been working on post-migration fixes and cleanup all day today. It’s getting late, so I should just finish this post and go to sleep. But I’ll leave you with this nice photo. I used to end these posts with funny comics and reddit screencaps.

Tree branches and flowers in the fog.

It’s a little wider than usual, because I’m adding new features, and this is the first one.

  1. Two TODOs for me: memorize those escape codes and add support for automatic smart quotes in post titles ↩︎
  2. I used Google Sheets to template a long list of SQL queries, based on a phpMyAdmin dump that I copied and pasted into a spreadsheet. Then, I copied those SQL queries back into phpMyAdmin to import everything into WordPress. ↩︎
  3. By my count, I’ve answered more than 5,000 questions so far. The same $44 annual website fee is enough to run 2017’s for about 2 weeks. ↩︎
  4. And that’s a big deal, I swear! ↩︎
  5. Gee, it’s 2017. Who would have thought that I still tested new code in production? ↩︎

Child prodigy

I watched a YouTube video this morning about a 13 year old boy taught himself to make iPhone apps and got famous for it. He took an internship at Facebook and then started working there full-time. There were TV stations and news websites that interviewed him and wrote about how he’s helping his family financially and how any teenager can start making tons of money if they just learn to code. And the story was nice and inspiring and stuff, except there are tons of kids that do the same thing and nobody writes articles about any of them. He’s probably 18 or 19 now1 and still working at Facebook as a product manager. How’s he feeling now? On the other hand, I’m a college senior, dreading the day when I have to start working like a grown-up and wondering if I’ll miss college and confused why people can’t just stay in college forever. He never went to college. He had probably gotten accepted at lots of different schools (did he even get a chance to apply?), but he decided college wasn’t worth the opportunity to work at Facebook and pull his family out of their crappy financial situation. Cheers to him.

I felt exactly the same way in high school. But I didn’t have a compelling reason to start working or the balls to deviate from the Good Kid Story™. I started making websites when I was 10, and by the time I finished high school, I could churn out CRUD web applications like any other rank-and-file software developer. Part of me honestly thought that I could skip a few semesters of class once I got to Berkeley, because I already knew about for-loops and I could write Hello World in a handful of languages. I thought college was going to be the place where people learn about the less-useful theoretical parts of programming. They’d teach me what a tree was, even though I never had any reason to use anything but PHP’s ubiquitous ordered hash map. I thought it wouldn’t be anything that I wouldn’t have learned anyways, if I just kept writing more and more code. And I was partially right, but also very wrong.

Getting a proper CS education is really important, and I wouldn’t recommend that anybody drop out or skip college, just so they can start working, especially if there isn’t a strong financial reason to do so. However, there’s two hard truths that people don’t like admitting about CS education: 1) most of the stuff taught to undergrads is also available on the Internet, and 2) most people who get a CS degree are still cruddy programmers. So, school isn’t irreplaceable and it’s not like attending school will magically transform you into a mature grown-up programmer. But that’s really not why getting a formal CS education is important.

After 7 semesters, it’s still hard to say exactly why people place a lot of value on getting a formal education in computer science. Most people need to be taught programming, because they have no experience and are in no shape to do anything productive with a computer. But for all the programming prodigies of the world, there needs to be another reason. I can say that I’m a much better programmer than I was four years ago. It always seems like the code I wrote the previous year is a pile of garbage2.

School forced me to learn things that I never would have learned on my own (because they were irrelevant to my own projects) nor would I have learned while working full-time (because they’d be irrelevant to the work I’d be doing). In high school, I had no idea people could write programs that did more than loading and saving data to a database. The classes I took actually expanded the range of what programs I thought were possible to write3.

When I taught myself things as a kid, I would enter a tight loop of learn-do-learn-do. Most of the code I wrote were attempts to get the Thing working as easily as possible, which ended up leading to a lot of frustration and wasted time. It’s hard to piece together a system before you understand the fundamental concepts. And that sounds really obvious, but a lot of programming tutorials seem to take that approach. They’ll tell you how to do the Thing, but they don’t bother giving you any intuition about the method itself. On the other hand, college classes have the freedom to explain the Thing in the abstract. Then once you start doing it yourself, you’ll know exactly what to look for4.

It’s really unfair to make a teenager make their own decisions about work and college, because you really shouldn’t be punished for making stupid life choices as a kid. Teaching myself programming as a kid was useful, but frankly I was a terrible teacher. But I’ve gotten better at that as well. This is my 5th semester as a teaching assistant, and I’ve picked up all kinds of awesome skills, from public speaking to technical writing, not to mention actual pedagogy as well. I’ve spent literally a thousand hours working on my tooling, because college convinced me that it really does matter5.

They say that it takes 10 years to really master a skill. Well, this is going to be my 12th year as a computer programmer, and I still don’t feel like I’ve mastered anything. I guess everybody learns in a different way, but it really sucks that society has convinced teenagers that college is optional/outdated. It’s easy to lure teenagers away from education with money and praise, especially because it’s really hard to see the point of a formal education when your entire programming career is creating applications that are essentially pretty interfaces to a database6. It doesn’t help that college-educated programmers are sometimes embarrassed to admit that school doesn’t work for everyone.

I wonder if that iPhone kid is disappointed with the reality of working full-time in software development. The free food and absurd office perks lose their novelty quickly.

  1. I have no idea actually. ↩︎
  2. Some people say that’s a good thing? I’ve realized that code is the enemy. The more code you write, the more bugs you’ve introduced. It’s incredibly hard to write code that you won’t just want to throw out next year. Code is the source of complexity and security problems, so the goal of software engineers is to produce less code, not more. When you have a codebase with a lot of parts, it’s easy to break things if you’re not careful. Bad code is unintuitive. Good code should be resistant to bugs, even when bad programmers need to modify it. ↩︎
  3. Little kids always tell you that programmers need to be good at math, which actually doesn’t make that much sense when I think about it. You need some linear algebra and calculus for computer graphics and machine learning. Maybe you’ll need to know modular arithmetic and number systems. But math really isn’t very important. ↩︎
  4. A huge number of software bugs are caused by the programmer misunderstanding the fundamentals of the thing they’re interacting with. ↩︎
  5. My favorite programming tools in high school were Adobe Dreamweaver and Notepad. I started using Ubuntu full-time in 11th grade, but didn’t make any actual efforts to improve my tools until college. ↩︎
  6. Not to underestimate the usefulness of simple CRUD apps. ↩︎

Email surveillance

There’s a new article in the SF Chronicle that says the University of California, Office of the President (UCOP) has been monitoring emails going in and out of the UC system by using computer hardware. I wanted to give my personal opinion, as a computer programmer and somebody who has experience managing mail exchangers1. The quotes in the SF Cron article are very generous with technical details about the email surveillance system. Most of the time, articles about mass surveillance are dumbed down, but this one gives us at least a little something to chew on.

Email was not originally designed to be a secure protocol. Over the three (four?) decades that email systems have been used, computer people have created several extensions to the original SMTP2 and 822 envelope protocol to provide enough modern security to make email “good enough” for modern use. Most email today is exchanged under the protection of STARTTLS, which is an extension for SMTP that upgrades a cleartext connection to an encrypted connection, if both parties support it. The goal of STARTTLS is to provide resistance against passive monitoring. It doesn’t provide any guarantees about the authenticity of the other party, because usually the certificates aren’t validated, so STARTTLS is still vulnerable against MITM attacks3. There are other email-security extensions. But they’re either designed for ensuring authenticity rather than privacy (like SPF, DKIM, and DMARC) or they’re not widely used (like GPG).

The only protection we have against passive snooping of emails is STARTTLS. According to the SF Cron article, the “intrusive device” installed at UC campuses is intended to capture and analyze traffic, rather than intercepting and modifying it. So, I took a look at some of the emails I’ve received at my personal address over the last 3.5 years of living in Berkeley. I looked specifically at the advertising emails I get from, because I’ve been receiving them consistently for many years, and they always come from the same place (Amazon SES). All of my most recent emails from Amazon follow this path, according to the email headers:

  • Amazon SES
  • UC Berkeley Mail Server “ees-ppmaster-prod-01”
  • 3 local mail filters, called “pps.reinject”, “pps.reinject”, and “pps.filterd”
  • UC Berkeley Mail Server “ees-sentrion-ucb3”
  • Google Apps Mail Server

Before April 2015, another UC Berkeley Mail Server was part of this path, in between the “sentrion” server and the Google Apps server. Before December 2014, the path looked completely different. There was only a single server between SES and Google, which was labeled “”.

According to the email headers, each step along the path is encrypted using STARTTLS, except for some of the local mail filters. Those 3 local mail filters are programs that run on the UC Berkeley Mail Server which might do things like scanning for viruses or filtering spam. They don’t exactly need encryption, because they don’t communicate over the network. I also noticed that before May 2015, there was only 1 local mail filter (the “pps.filterd” one) instead of 3.

The SF Cron article mentions that email surveillance started after attacks on UCLA Medical Center, which occurred in July 2015. Unfortunately, nothing significant seems to have changed in the email headers between June and October of 2015. But the use of STARTTLS, even within UC Berkeley’s own networks, casts doubt on the idea that UCOP surveillance was implemented as passive network monitoring.

If the surveillance was implemented at the network level, it would have to proxy the SMTP connections between all of the “ppmaster” and “sentrion” servers, as well as spoof the source IP or routing tables or reverse DNS lookup tables of the entirety of Berkeley’s local email network. It’d be an unnecessarily sophisticated method, if they just wanted to hide the presence of surveillance hardware.

On the other hand, if surveillance was implemented with the cooperation of campus IT staff, it would be pretty simple to implement for all emails campus-wide. There are already plenty of unlabeled local mail filters in place. These could easily be configured to forward an unencrypted copy of all emails to a 3rd party vendor’s system, for monitoring and analysis. Additionally, “sentrion”, which probably refers to SendMail’s Sentrion product, looks like it was expressly designed for the purpose of recording and analyzing large amounts of email.

There are a couple of problems if email monitoring really were implemented on the mail servers themselves with the cooperation of campus IT staff. If this is really the case, then it would require another system to monitor web traffic, which doesn’t seem to be explained in the article. Or perhaps, the claim that web traffic were being monitored is incorrect4.

I’ve always accepted that work email should be considered the property of your employer. Your personal stuff should stay on your personal cell phone and email accounts. However, students are not employees of the University5. I don’t know much about law, but I feel like FERPA was passed to address these kinds of privacy questions regarding students and academic institutions. Implementing mass email surveillance without consulting faculty and students, regardless of its legality, seems underhanded and embarrassing for what claims to be the number one public university in the world.

  1. I’m currently a student and (technically?) an employee of UC Berkeley. But these opinions are my own. ↩︎
  2. The Simple Mail Transfer Protocol, which is used to deliver all publicly-routed email. ↩︎
  3. Man-in-the-middle attacks ↩︎
  4. Most web traffic (including RogerHub) goes through HTTPS today anyway. Monitoring web traffic without a MITM proxy would be ineffective. ↩︎
  5. Unless you happen to be both. ↩︎