Google+

Pretty good computer security

I don’t cut my own hair. I can’t change my own engine oil. At the grocery store, I always pick Tide and Skippy, even though the generic brand is probably cheaper and just as good. “Well, that’s dumb,” you say. “Don’t you know you could save a lot of money?”, you say. But that isn’t the point. Sure, my haircut is simple. I could buy some hair clippers, and I’d love to save a trip to the dealership every year. But there are people who are professionals at cutting hair and fixing cars, and I wouldn’t trust them to program their own computers. So why should they trust me to evaluate different brands of peanut butter?

Fortunately, they’ve made the choice really simple, even for amateurs like me. In fact, everyone relies on modern convenience to some extent. Even if you cut your own hair and even if you’re not as much of a Skippy purist, you’d have to admit. People today do all kinds of crazy complicated tasks, and none of us has time to become experts at all of them.

Take driving, for example. Cars are crazy complicated. Most people aren’t professional drivers. In fact, most people aren’t even halfway decent drivers. But thanks to car engineers, you can do a pretty good job at driving by remembering just a few simple rules1. There are lots of other things whose simplicity goes unappreciated: filing taxes and power outlets and irradiating food with freaking death rays (okay, yes, I’m just naming things I see in my immediate vicinity). We’re lucky that people long ago had the foresight to realize how important/dangerous these things could be, and so they’ve all been aggressively simplified to protect people from themselves.

Are we done then? Has everything been child proofed already? It’s easy to believe so. But if you’re unfortunate enough to still own a computer in 2018, then you’ll know that we still have a ways to go. Computer are deceptively simple. Lots of people use them: adults and little kids and old folks. But despite their wide adoption, it’s still far too easy to use a computer the wrong way and end up in a lot of trouble. Here are a few things I’ve learned about computer security that I want to share. I won’t claim that following these steps will make you invulnerable, because that’s an unattainable goal. But I think these form a pretty good baseline, with which you can at least say you’ve done your due diligence.

Stop using a computer

I’m serious. How about an iPad? Not an option? Keep reading.

Use multi-factor authentication

This is the number one easiest and most effective thing you can do to improve your security posture right away. Many websites offer a feature where you’re required to type in a special code, in addition to your username and password, before you can log in. This code can either be sent to you via text message, or it can be generated using an app on your smartphone2. Multi-factor authentication is a powerful defense against account hijacking, but make sure to keep emergency codes as a backup in case you lose your MFA device.

Get a password manager

People are bad at memorizing passwords, and websites are bad at keeping your passwords safe. You should adopt some method that allows you to use a different, unpredictable password for each of your accounts, whether that’s a password manager or just a sheet of notebook paper with cryptic password hints (but see my post about data loss).

Try post-PC devices

Mobile operating systems (iOS and Android) and Chromebooks are the quintessential examples of trusted platforms with aggressively proactive protection (hardware trust anchors, signed boot, sandboxing, verified binaries, granular permissions). Most people already know this. But perhaps it’s not so obvious how poorly some of these protections translate to traditional programs.

Every program on your computer interacts with the file system, and while most well-behaved programs keep to themselves, every single one technically has access to all of your files. This puts everything on your filesystem, like your documents and even your browser cookies3, at risk. Programs have evolved with this level of flexibility, so filesystem access isn’t something that can easily be taken away. There have been attempts to sandbox traditional computer programs, like those found on the Mac App Store and Windows Store. But these features are strictly opt-in, and most computer programs will likely never adopt them.

There’s plenty of stuff to steal in the filesystem, but that’s not all. Modern operating systems offer debugging interfaces, which allow you to read and write the memory of other programs. They also offer mechanisms for you to read the contents of the clipboard, or take screenshots of other programs, or even control the user’s input devices. These all sounds like terrifying powers in the hands of a malicious program. But don’t forget that poorly-written and insecure programs can just as easily undermine your security, by allowing remote or web-based attackers to take control of them.

Because mobile and web-based platforms are newer, they’ve been able to lock down these interfaces. These new platforms don’t need to support a long legacy of older programs that expect such broad and unfettered access. Each mobile app typically gets an isolated view of the filesystem, where it can only access the files relevant to it. Debugging is disabled, and screenshots are a privilege reserved for system apps.

Don’t type passwords in public

If you film somebody typing their password, it’s surprisingly easy to play it in slow motion and see each key as it’s being pressed. This is especially true if you have a high frame rate camera, or if the victim is typing their password into a smartphone keyboard (or is just really slow at typing). Even if the attacker can’t see your fingers, there’s lots of new research in using acoustic or electromagnetic side channels to record typing from a distance. When you have to type your password, try covering your fingers by closing your laptop halfway, or go to the bathroom and do it there.

Don’t type passwords at all

Lots of people use a simple 6-digit PIN number or a pattern to unlock their smartphone. At a glance, this seems safe, because the number of possible PINs and patterns greatly exceeds how many times the phone lets you attempt to unlock it. But 10-digit keypads and pattern grids are also easily discernible from a long distance. Plus, phones that use at-rest encryption typically derive encryption keys from the unlock passphrase. There isn’t enough entropy in your 6-digit PIN or pattern to properly encrypt your data.

If you care about locking your phone securely, then you should use its fingerprint sensor and set up a long alphanumeric passphrase. You’ll rarely need to actually type your passphrase, as long as your fingerprint sensor works correctly. Thieves can’t steal your fingerprint (unless they’re really motivated), and you can rest assured that your strong password will protect your phone’s contents from brute force attacks.

Full disk encryption

If you have a computer password, but you don’t use full disk encryption, then your password is practically useless against a motivated thief. In most computers, the storage medium can be easily removed and transplanted into a different computer4. This technique allows an adversary to read all your files without needing your computer password.

To prevent this, you should enable full disk encryption on your computer (FileVault on macOS, or BitLocker on Windows). Lots of modern computers and smartphones enable full disk encryption by default. Typical FDE implementations use your login password and encrypt/decrypt files automatically as they’re read from and written to disk. Once the computer is shut down, the key material is forgotten and your data becomes inaccessible. Since the encryption keys are derived from your login password, you should make sure that it’s strong enough to stand up to brute force attacks.

Don’t use antivirus software

What is malware? Antivirus software use a combination of heuristics (patterns of behavior) and signatures (file checksums, code patterns, C&C servers, altogether known as indicators of compromise) to detect bad programs. They’re really effective in stopping known threats from spreading on vulnerable computer networks (picture an enterprise office network). But nowadays, all that antivirus software buys you is some peace of mind, delivered via a friendly green check mark.

A doctor in the United States wouldn’t recommend you buy a mosquito net. Similarly, antivirus software is a superfluous and unsuitable countermeasure for the modern day threats. It doesn’t protect against OAuth worms. It doesn’t protect against network-exploitable RCEs. It doesn’t protect against phishing (and when it tries to, it usually just makes everything worse). Since you’re reading this blog post, I assume you’re at least making an effort to improve your security posture. That fact alone puts you squarely in the group of people who don’t need antivirus software. So get rid of it.

Get a gaming computer

Games need to be fun. They don’t need to be secure. Given the amount of C++ code found in typical high-performance 3D games, I think it’s fairly like that most of them have at least one or two undiscovered network-exploitable memory corruption vulnerabilities. Plus, lots of games come bundled with cheat detection software. These programs typically use your operating system’s debugging interfaces to peek into the other programs running on your computer. I like computer games, but I’d be uncomfortable running them alongside the programs that I use to check my email, fiddle with money, and run my servers. So if you play games, you should consider getting a separate computer that’s specifically dedicated to running games and other untrusted programs.

Get an iPhone

You need a phone that receives aggressive security updates. You need a phone that doesn’t include obscure unvetted apps, preloaded by the manufacturer. You need a phone with million dollar bounties for security bugs. You need a phone made by people who take pride in their work, people who love presenting about their data encryption schemes, people with the courage to stand up against unreasonable demands from their government.

Now, that phone doesn’t need to be an iPhone, per se. But that’s the phone I’d feel most comfortable recommending to others. And if that’s not an option, maybe a Google Pixel?

Research your software

Who makes your software? Do you know? Have you met them? (If you live in Silicon Valley, that’s not such a rhetorical question.) Because security incidents happen all the time, it’s easy to think that they’re random and unpredictable, like getting struck by lightning. But if you pay attention, you’ll notice that a lot of incidents can be attributed to negligence, laziness, or apathy. Software companies that prioritize security are quite rare, because security is a never ending quest, it costs lots of money, and in most cases, it produces no tangible results.

If you’re a computer programmer, you should get in the habit of reading the source code of programs that you use. And if that’s unavailable, then you can use introspection tools5 to reverse engineer them. After all, how can you trust a piece of software to be secure if you’ve never even tried to attack it?

You should also research the software publishers, and decide which ones you do and don’t trust. Do they have a security incident response team? Do they have a security bug bounty? A history of negligent security practices? Are they known for methodical engineering? Or do they prefer getting things done the quick and easy way? If you only run programs written by publishers that you trust, then you can greatly limit your exposure against poor engineering.

Use secure communication protocols

The gold standard in secure communication is end to end encryption with managed keys. This is where your messages are only readable by you and the recipient, but you rely on a trusted third party to establish the identity of the recipient in the first place. You don’t always need to use end to end encryption, since other methods might be more widely adopted or convenient.

Don’t use browser extensions

Browser extensions are so incredibly useful, but I don’t feel comfortable recommending any browser extension to anybody (except those that don’t require any special permissions). There have been so many recent incidents where popular browser extensions were purchased from the original developer, in order to force ads and spyware onto their large preexisting user base. Browser extensions represent one of the few security weaknesses of the web, as a platform. So, you should only trust browser extensions published by well-established tech companies or browser extensions that you’ve written (or audited) yourself.

Turn off JavaScript

I mentioned before that websites run in an isolated sandbox, away from the rest of your computer. That fact supposedly makes web applications more secure than traditional programs. But as a result, web browsers don’t hesitate to boldly execute whatever code they’re fed. To reduce your exposure, you should consider only allowing JavaScript for a whitelist of websites where you actually need it. For example, there are plenty of news websites that are perfectly legible without JavaScript. They’re also the sites that use sketchy ad networks full of drive-by malware, probing for vulnerabilities in your browser.

Browsers allow you to disable JavaScript by default and enable it on a site-by-site basis. When you notice that a site isn’t working without JavaScript, you can easily enable it and refresh the page.

Don’t get phished

The internet is full of liars. You can protect yourself by paying attention to privileged user interface elements (like the address bar), using bookmarks instead of typing URLs, and maintain a healthy amount of skepticism.

Get ready to wipe

If you suspect that malware has compromised your computer, then you should wipe and reinstall your operating system. Virus removal is a fool’s errand. As long as you have recent data backups (and you’re certain that your backups haven’t also been infected), then you should be able to wipe your computer without hesitation.

  1. One pedal goes vroom. The other brakes. Clockwise goes to your right, and counter-clockwise, to your other right. If the dashboard lights up orange, pull over and check your tires. These simple rules are all you need to know in order to drive a car. The other parts don’t really matter. We expect cars to Just Work™, and for the most part, they do. ↩︎
  2. Or a U2F security key, if you really love spending money. ↩︎
  3. Encryption using kernel-managed keys makes this more difficult ↩︎
  4. Soldered components can make this harder. ↩︎
  5. Look at its open files, its network traffic, and its system calls. ↩︎

The rules are unfair

It’s 11pm. I should go to sleep. But sometimes, when it’s 11pm and I know I should go to sleep, I don’t. Instead, I stay up and watch dash cam videos on the internet. Nowadays, you can pop open your web browser and watch what are perhaps the worst moments of someone’s life, on repeat, in glorious high definition. It’s tragic, but also viscerally entertaining and conveniently packaged into small bite sized clips for easy consumption. Of course, not all the videos involve car accidents. Some of them just show bad drivers doing stupid things that happened to be caught on camera. In any case, I’ll probably read some comments, have a chuckle, and then eventually feel guilty enough to go to bed.

Dash cams are impartial observers. They give us rare glimpses into the real world, without the taint of unreliable human narrators1. Unlike most internet videos, there aren’t any special effects or editing. Dash cams don’t jump around from one interesting angle to the next, and they don’t come with cinematic, mood-setting background music (unless the cammer’s radio happens to be on). Instead, dash cams just look straight forward, paying as much attention to the boring as to the interesting, recording a small slice of history with the utmost dedication, until some impact inevitably knocks them off their windshield mount.

You can analyze dashcam footage, frame by frame, to see exactly why things played out the way they did. Most of us drive every day, and most of us know the risks, but these videos don’t bother us. We convince ourselves that “that could never happen to me, because I don’t follow trucks that closely” or “I always look both ways before entering a fresh green light”. Even when an accident or a near miss isn’t the cammer’s fault, we can still usually find something to blame them for. Somebody else may have caused the accident, but perhaps it was the cammer’s own poor choices that created the opportunity to begin with. Who knows? If the cammer had driven more defensively, then maybe nothing would have happened, and there would be one fewer video delaying my bedtime this evening. It’s only natural for us to demand such an explanation. We want evidence that the cammer deserved what happened to them. Because why would bad things just happen to not-bad people for no reason? That makes no sense. We’re not-so-bad people ourselves, after all. And if we can’t find fault with the cammer, then why aren’t all those same terrible things happening to us?

A few weeks ago, some Canadian kid downloaded a bunch of unlisted documents2 from a government website, and so the police went to his house and took away all his computers. “Bah, that’s absurd!”, you say. “Just another handful of bureaucrats that don’t understand technology”, you say. And as a fan of computers myself, I’d have to agree. It’s obviously the web developer’s fault that the documents were accessible without needing to log in, and it’s the web developer’s fault that even a teenager could trivially guess their URLs. But I also realize that, as computer programmers, it’s our natural instinct to focus only on the technical side of an issue. After all, how many countless months have we spent working on access control and security for computer systems? It’s easy for us to see this as a technical failure, to indemnify the kid, and to place the blame entirely on whoever created the website. But alas, the police don’t arrest bad programmers for making mistakes3.

Plenty of people scrape all kinds of data from websites, sometimes in ways that don’t align with the web developer’s intentions. Should they all expect the police to raid their houses too? Well, there are actually several legitimate reasons why this kind of activity could be considered immoral or even illegal (copyright infringement, violating terms of service, or consuming a disproportionate amount of computer resources). But a lot of laws and user agreements also use phrases like “unauthorized use of computers”, and with such vague language, even something as innocent as downloading publicly available files could be classified as a violation. I mean, sure. It’d be difficult to prove that Canada kid had any kind of criminal intent. But it’s pretty clear that nobody authorized him, explicitly or implicitly (by way of hyperlinks), to download all of those files. Isn’t that “unauthorized use”?

Laws are complicated. I’m no lawyer. I don’t even understand most of the regulations that apply to computers and the internet in the United States. But recently, I’ve come to realize that there’s a fundamental disconnect between laws and morality. When people don’t understand their laws, they tend to just assume that the law basically says “do the right thing”. And if you have even a shred of self respect, then you’ll probably reply “I’m doing that already!”. Naturally, we can’t be expected to know all the particulars of what laws do and don’t allow us to do. So, some amount of intuition is required. Fortunately, nobody bothers to prosecute minor infractions, and people tend to talk a lot about unintuitive or immoral laws (marriage, abortion, weed, etc), so they quickly become common knowledge. But I’d argue that laws aren’t actually an approximation of morality. It’d be more accurate to say that laws are an approximation of maximum utility (much like everything else, if you believe utilitarianism).

There are some laws that exist, not because they’re moral truths, but because people are just generally happier when everybody obeys them. Is it immoral to suggest that people are predisposed to certain careers on the basis of their protected classes? Hard to say, but that’s beside the point. We don’t permit those ideas in public discourse, because people are, as a whole, happier and more productive when we don’t talk about those things.

Laws also aren’t an approximation of fairness, but are only fair to the extent that perceived fairness contributes to overall utility. What makes a fair law? That it benefits everyone equally? Obviously, a fair healthcare system should benefit really sick people more than it benefits people who just want antibiotics for their cold. Maybe a fair law should apply equally to everyone, regardless of their protected classes. But under that definition, regressive income tax and compulsory sterilization would be classified as fair laws. Should wealth and fitness be protected classes? What about age? There are plenty of laws that apply only to children4, but then again, maybe that’s why they’re always complaining that the rules are unfair.

I’ve learned to reduce my trust in rules, and I’ve started to distinguish what’s right from what’s allowable. You could argue that, in some sense, utilitarianism is the ultimate form of morality. But there’s a bunch of pitfalls in that direction, so I’ll stop writing now and go to sleep.

The coast near Pacifica, California.

  1. Or so we’d like to think. I feel like most of the time, dash cam videos unfairly favor the cammer. Unlike humans, dash cams can’t turn their head, have a narrower field of vision, and usually only show what’s in front of the vehicle. Actual drivers are expected to be more aware of their surroundings than what a dash cam allows. ↩︎
  2. Even though the documents were unlisted, they had predictable numeric URLs, and so it was trivial to guess the document addresses and download them all. ↩︎
  3. They blame architects when bridges collapse. Should computer programs that are relevant to public safety be held to the same standard? ↩︎
  4. I think that it’s impossible to believe in fairness unless you also believe in the idea of the individual. Are children individuals? What about conjoined twins? ↩︎

Computer programming

I think I like being a computer programmer. Really. It’s not just something I say because it sounds good in job interviews. I’ve written computer programs almost every day for the last five years. I spend so much time writing programs that I can’t imagine myself not being a computer programmer. Like, could you enjoy food without knowing how to cook? Or go to a concert, having never seen a piece of sheet music? Yet plenty of people use computers while not also knowing how to reprogram them. That’s completely foreign to me and makes me feel uncomfortable. Fortunately, watching uncomfortable things also happens to be my favorite pastime.

In fact, I spend a lot of time just watching regular people use their computers. If we’re on the bus and I’m staring over your shoulder at your phone, I promise it’s not because I have any interest in reading your text messages. I actually just want to know what app you’re using to access your files at home. After all, you probably have just as many leasing contracts PDFs and scanned receipts as I do. Yet, you somehow manage to remotely access yours without any servers or programmable networking equipment in your apartment.

My own digital footprint has gotten bigger and more complicated over the years. Right now, the most important bits are split between my computers at home and a small handful of cloud services1. At home, I actually use two separate computers: my MacBook Pro and a small black and silver Intel NUC that sits mounted on my white plastic Christmas tree. The NUC runs Ubuntu and is responsible for building all the personal software I write, including Hubnext which runs this website. I also use it as a staging area for backups and large uploads to the cloud that run overnight (thanks U.S. residential ISPs). I use the MacBook Pro for everything else: web browsing, media consumption, and programming. It’s become fairly easy to write code on one machine and run it on the other, thanks to a collection of programs I slowly built up. Maybe I’ll write about those one day.

At some point after starting my full time job, I noticed myself spending less and less time programming at home. It soon became apparent that I was never going to be able to spend as much time on my personal infrastructure as I wanted to. In college, I ran disaster recovery drills every few months or so to test my backup restore and bootstrapping procedures. I could work on my personal tools and processes for as long as I wanted, only stopping when I ran out of ideas. Unfortunately, I no longer have unlimited amounts of free time. I find myself actually estimating the required effort for different tasks (can you believe it?), in order to weigh them against other priorities. My to-do list for technical infrastructure continues to grow without bound, so realistically, I know that I’ll never be able to complete most of those tasks. That makes me sad. But perhaps what’s even worse is that I’ve lost track of what I liked about computer programming in the first place.

The shore near Pacifica Municipal Pier.

Lately, I’ve been spending an inordinate amount of time working on Chrome extensions. There are a handful that I use at home and others that I use at work. But unlike most extensions, I don’t publish these anywhere. They’re basically user scripts that I use to customize websites to my liking. I have one that adds keyboard shortcuts to the Spotify web client. Another one calculates the like to dislike ratio and displays it as a percentage on YouTube videos. Lots of them just hide annoying UI elements or improve the CSS of poorly designed websites. Web browsers introduced all sorts of amazing technologies into mainstream use: sandboxes for untrusted code, modern ubiquitous secure transport, cross-platform UI kits, etc. But perhaps the most overlooked is just how easy they’ve made it for anybody to reverse engineer websites through the DOM2. Traditional reverse engineering is a rarefied talent and increasingly irrelevant for anybody who isn’t a security researcher. But browser extensions are approachable, far more useful, and also completely free, unlike a lot of commercial RE tools.

When I work on browser extensions, I don’t need a to-do list or any particular goal. I usually write extensions just to scratch an itch. I’ll notice some UI quirk, open developer tools, hop over to Atom, fix it, and get right back to browsing with my new modification in place. Transitioning smoothly between web browsing and extension programming is one of the most pleasant experiences in computer programming: the development cycle is quick, the tools are first class, and the payoff is highly visible and immediate. It made me remember that computer programming could itself be enjoyed as a pastime, rather than means to an end.

The bluffs at Mori Point.

I’m kind of tired of sitting down every couple months to write about some supposedly fresh idea or realization I’ve had. At some point, I’ll inevitably run out of things to say. Until that happens, I guess I’ll just keep rambling.

The other major recent development in my personal programming work is that I’ve started merging all my code into a monorepo. The repo is just named “R”, because one advantage of working for yourself is that you don’t have to worry about naming anything the right way3. It started out with just my website stuff, but since then I’ve added my security camera software, my home router configuration, my data backup scripts, a bunch of docs, lots of experimental stuff, and a collection of tools to manage the repo itself. Sharing a single repo for all my projects comes with the usual benefits: I can reuse code between projects. It’s easier to get started on something new. When I work on the repo, I feel like I’m working on it as a whole, rather than any individual project. Plus, code in the repo feels far more permanent and easy to manage than a bunch of isolated projects. It’s admittedly become a bit harder to release code to open source, but given all the troublesome corporate licensing nonsense involved, I’m probably not planning to do that anyway.

I feel like I’m just now finishing a year-long transition process from spending most of my time at school to spending most of my time at work. It took a long time before I really understood what having a job would mean for my personal programming. My hobby had been stolen away by my job, and to combat that feeling, I dedicated lots of extra time to working on personal projects outside of work. But instead of finding a balance between hobbies and work, the extra work just left me burnt out. That situation has improved somewhat, partially because I’m been focusing on projects that let me reduce how much time I spend maintaining personal infrastructure, but also because I’ve accomplished all the important tasks and I’ve learned to treat the remainder as a pastime instead of as chores. But there’s still room for improvement. So, if you’re wondering what I’m working on cooped up in my apartment every weekend, this is it.

At some point, I should probably write an actual guide about getting started with computer programming. People keep emailing me about how to get started, even though I’m not a teacher (at least not anymore) and I don’t even regularly interact with anyone who’s just starting to learn programming. And I should probably do it before I start to forget what it’s like to enjoy programming altogether.

  1. Things are supposedly set up with enough redundancy to lose any one piece, but that’s probably not entirely true. I’ve written about this before. In any case, one of the greatest perks of working at the big G is the premium customer support that’s implicitly offered to all employees, especially those in engineering. If things really went south, I guess I could rely on that. ↩︎
  2. I really appreciate knowing just enough JavaScript to customize web applications with extensions. It’s one of the reasons I gave up vim for Atom. ↩︎
  3. Lately, I’ve been fond of naming things after single letters or phonetic spellings of letters. I have other things named A, B, c2, Ef, gee, Jay, K, L, Am, and X. ↩︎

Net neutrality

I don’t know a single person who actually supports the FCC’s recent proposal to deregulate broadband internet access by reclassifying it as an information service. However, I also never did meet a single person who unironically supported last year’s President-elect, but that doesn’t seem to have made any difference. There were certainly a lot of memes in last year’s U.S. election cycle, but I remember first seeing memes about net neutrality when I was in high school. That was before all the Comcast-Verizon-YouTube-Netflix hubbub and before net neutrality was at the forefront of anyone’s mind. So naturally, net neutrality got tossed out and ignored along with other shocking but “purely theoretical” future problems like climate change and creepy invasive advertising1. But I’ve seen some of those exact same net neutrality infographics resurface in the last couple weeks, and in retrospect, I realize that many of them were clearly made by people weren’t network engineers and weren’t at all familiar with how the business of being an ISP actually works. And so, the materials used in net neutrality activism were, and still are, sometimes inaccurate, misleading, or highly editorialized to scare their audience into paying attention2.

Now, just to be clear: Do I think the recent FCC proposals are in the best interest of the American public? Definitely not. And do I think that the general population is now merely an audience to political theater orchestrated by, or on behalf of, huge American residential ISPs? I think it’s probable. After all, why bother proposing something so overwhelmingly unpopular unless it’s already guaranteed to pass? Nevertheless, I feel like spreading misinformation about the realities of net neutrality is only hurting genuine efforts to preserve it.

Before I begin, I should mention that you can read the full 210 page proposal on fcc.gov yourself if you want the full picture. I hate that news websites almost never link to primary sources for the events they’re covering, and yet nobody really holds them accountable to do so. Anyway, I’m no lawyer, but I feel like in general, the legal mumbo-jumbo behind controversial proposals is usually far more subtle and less controversial-sounding than the news would have you believe. That’s very much the case here. And for what it’s worth, I think this particular report is very approachable for normal folks and does do a good job of explaining context and justifying its ideas.

Brush plant in San Francisco.

If you haven’t read the proposal, the simple version is that in mid-December 2017, the FCC will vote on whether to undo the net neutrality rules they set in 2015 and replace them with the requirement that ISPs must publish websites that explain how their networks work. The proposal goes on to explain how the existing rules stifle innovation by discouraging investment in heavily-regulated network infrastructure. In practice, the proposed changes would eliminate the “Title II” regulations that prevent ISPs from throttling or blocking legal content and delegate the policing of ISPs to a mix of the FTC, antitrust laws, and market forces.

It’s hard to predict the practical implications of these changes, but based on past examples, many people point to throttling of internet-based services that compete with ISPs (like VoIP) and high-bandwidth applications (like video streaming and BitTorrent), along with increased investment in ISP-owned alternatives to internet services for things like video and music streaming. As a result, a lot of net neutrality activism revolves around purposefully slowing down websites or presenting hypothetical internet subscription plans that charge extra fees for access to different kinds of websites (news, gaming, social networking, video streaming) in order to illustrate the consequences of a world without net neutrality. But in reality, neither of these scenarios is very realistic, and yet both of them already exist in some form today.

You probably don’t believe me when I say that throttling is unrealistic, and I understand. After all, we’ve already seen ISPs do exactly that to Netflix and YouTube. But first, you should understand a few things about residential ISPs.

Like most networks, residential ISPs are oversubscribed, meaning that the individual endpoints of the network are capable of generating more traffic than the network itself can handle. They cope with oversubscription by selling subscription plans with maximum throughput rates, measured in megabits per second3. Upstream traffic (sent from your home to the internet) is usually throttled down to these maximum rates by configuration on your home modem. But downstream traffic (sent from the internet to your home) is usually throttled at the ISP itself by delaying or dropping traffic that exceeds this maximum configured limit. So you see, the very act of throttling or blocking traffic isn’t a concern for net neutrality. In fact, most net neutrality regulations have exemptions that allow this kind of throttling when it’s for purely technical reasons, because some amount of throttling is an essential part of running a healthy network.

Furthermore, all ISPs already discriminate (e.g. apply different throttling or blocking rules) against certain types of traffic by way of classification. At the most basic level, certain types of packets (like zero-length ACKs) and connections (like “mice flows”) are given priority over others (like full sized data packets for video streaming) as part of a technique known as quality of service (QoS). Many ISPs also block or throttle traffic on certain well-known ports, such as port 25 for email and port 9100 for printing, because they’re commonly abused by malware and there’s usually no legitimate reason to route such traffic from residential networks onto the open internet. Furthermore, certain kinds of traffic can be delivered more quickly and reliably simply because of networking arrangements made between your ISP and content providers (like Netflix Open Connect). In other cases, your ISP may be stuck in a disadvantageous peering agreement, in which it has to pay another ISP extra money to send or receive traffic on certain network links, in addition to just the costs of maintaining the network itself.

People generally agree that none of these count as net neutrality violations, because they’re practical aspects of running a real network and, in many cases, they justify themselves by providing real value to end users. It’s difficult to explain concisely what divides these kinds of blocking and throttling from the scandalous net neutrality kind. Supposedly, net neutrality violations typically involve blocking or throttling for “business” reasons, but “reducing network utilization by blocking our competitors” could arguably have technical benefits as well. In practice, most people call it a net neutrality violation when it’s bad for customers and call it “business as usual” when it’s either beneficial for customers or represents the way things have always worked. In any case, the elimination of all blocking and throttling is neither practical nor desirable. When discussing net neutrality, it’s important to acknowledge that many kinds of blocking and throttling are legitimate and to (try to) focus on the kinds that aren’t.

Leaves in Golden Gate Park.

Websites that purposefully slow themselves down paint a wildly inaccurate picture of a future without net neutrality, especially when they do so without justification. ISPs gain nothing from indiscriminate throttling, other than saving a couple bucks on power and networking equipment. Plus, ISPs can (and do) get the same usage reduction benefits by imposing monthly bandwidth quotas, which have nothing at all to do with net neutrality. I think a more likely outcome is that ISPs will start pushing for the adoption of new heterogeneous internet and TV combo subscription plans. These plans will impose monthly bandwidth quotas on all internet traffic except for a small list of partner content providers, which will complement a larger emphasis on ISP-provided TV and video on demand services. After all, usage of traditional notebook and desktop computers is on the decline in favor of post-PC era devices like smartphones and tablets. A number of U.S. households would probably be perfectly happy to trade boring unmetered internet for a 10GB/month residential broadband internet plan combined with a TV subscription and unlimited access to the ISP’s first-party video on demand service along with a handful of other top content providers. Such a plan could eliminate the need for third-party video streaming subscriptions like Netflix, thereby providing more content for less money. Naturally, a monthly bandwidth quota would make it difficult for non-partner video streaming services to compete effectively, but fuck them, right?

I should point out that no matter what happens to net neutrality, we’ll still have antitrust laws (the proposal mentions this explicitly) and an aggressive DoJ to chase down offenders. Most ISPs operate as a local monopoly or duopoly. So, using their monopoly position in the internet access market to hinder competition in internet content services sounds like an antitrust problem to me. But it’s possible that the FCC’s reclassification of internet access as an “information service” may change this.

The other example commonly used by net neutrality activists is the a-la-carte internet subscription. In this model, different categories of content (news, gaming, social networking, video streaming) each require an extra fee or service tier, sort of like how HBO and Showtime packages work for TV today. For this to work, ISPs need to be able to block access to content that subscribers haven’t paid for. In the past, this might have been implemented with a combination of protocol-based traffic classification (like RTMP for video streaming), destination-based traffic classification (well known IP ranges used by online games), and plain old traffic sniffing (reconstructing plaintext HTTP flows). But such a design would be completely infeasible from a technical standpoint in today’s internet.

First, nearly all modern internet applications use some variant of HTTP to carry the bulk of their internet traffic. Even applications that traditionally featured custom designed protocols (video conferencing, gaming, and media content delivery) now almost exclusively use HTTP or HTTP-based protocols (HLS, WebSockets, WebRTC, REST, gRPC, etc). This is largely because HTTP is the only protocol that has both ubiquitous compatibility with corporate firewalls and widespread infrastructural support in terms of load balancing, caching, and instrumentation. As a result, it’s far more difficult today to categorize internet traffic with any degree of certainty based on the protocol alone.

Additionally, most of the aforementioned HTTP traffic is encrypted (newer variants like SPDY and HTTP/2 virtually require encryption to work). For the a-la-carte plan to work, you need to first categorize all internet traffic. We can get some hints from SNI and DNS, but that’s not always enough and also easily subverted.

Internet applications with well-known IP ranges are also a thing of the past. Colocation has given way to cloud hosting, and it’s virtually impossible to tell what’s inside yet another encrypted HTTPS stream to some AWS or GCP load balancer.

Essentially, there can’t truly exist a “gaming” internet package without the cooperation of every online game developer on the planet.

A-la-carte models work well with TV subscriptions because there are far fewer parties involved. If ISPs ever turn their attention to gamers, it’s most likely that they’ll partner with a few large “game networks” that can handle both the business transactions4 and the technical aspects of identifying and classifying their game traffic on behalf of residential ISPs. So, you probably won’t be buying an “unlimited gaming” internet package anytime soon. Instead, you’ll be buying unlimited access to just Xbox Live and PSN. From that point on, indie game developers will simply have to fit in your monthly 10GB quota for “everything else”.

Reeds near the San Francisco Bay.

Net neutrality activists say that net neutrality will keep the internet free and open. But the very idea of a “free and open” internet is a sort of a myth. To many people, the ideal internet is a globally distributed system of connected devices, where every device can communicate freely and equally with every other device. In a more practical sense, virtually anybody should have the power to publish content on the internet, and virtually anybody should be able to consume it. No entity should have control over the entire network, and no connection to the internet should be more special than any other, because being connected to the internet should mean that you’re connected to all of it.

In reality, people have stopped using the internet to communicate directly with one another. Instead, most internet communication today is mediated by the large global internet corporations that run our social networks, our instant messaging apps, our blogs, our news, and our media streaming sites. Sure, you’re “free” to post whatever garbage you’d like to your Tumblr or Facebook page, but only as long as those companies allow you to do so.

But the other half of a “free and open” internet means that anybody can start their own Facebook competitor and give control back to the people, right? Well, if you wanted to create your own internet company today, your best bet would be to piggyback off of an existing internet hosting provider like AWS or GCP, because they’ve already taken care of the prerequisites for creating globally accessible (except China) internet services. At the physical level, the internet is owned by the corporations, governments, and organizations that operate the networking infrastructure and endpoints that constitute the internet. The internet only works because those operators are incentivized to bury thousands of miles of optical fiber beneath the oceans and hang great big antennas on towers in the sky. It only works because, at some point in the past, those operators made agreements to exchange traffic with each other5 so that any device on one network could send data that would eventually make it to any other network. There’s nothing inherent about the internet that ensures the network is fully connected (in fact, this fully-connectedness breaks all the time during BGP leaks). It’s the responsibility of each individual content publisher to make enough peering arrangements to ensure they’re reachable from anyone who cares to reach them.

Not all internet connections are equal. Many mobile networks and residential ISPs use technologies like CGNAT and IPv6 tunneling to cope with IPv4 address exhaustion. But as a result, devices on those networks must initiate all of their traffic and cannot act as servers, which must be able to accept traffic initiated by other devices. In practice, this isn’t an issue, because mobile devices and (most) residential devices initiate all of their traffic anyway. But it does mean that such devices are effectively second class citizens on the internet.

It’s also an increasingly common practice to classify internet connections by country. Having a privileged IP address, like an address assigned to a residential ISP in the United States, means having greater access and trust when it comes to country-restricted media (like Netflix) and trust-based filtering (like firewalls), compared to an address assigned to a third world country or an address belonging to a shared pool used by a public cloud provider. This is especially the case with email spam filtering, which usually involves extensive IP blacklists in addition to rejecting all traffic from residential ISPs and shared public cloud IP pools.

Finally, let’s not forget those countries whose governments choose to filter or turn off the internet entirely on some occasions. But they have bigger things to worry about than net neutrality anyway.

So, is the internet doomed? Not quite. It’s already well known that last mile ISPs suck and IPv4 exhaustion sucks and IP-based filtering sucks. But as consumers, we still need strong government regulation on residential internet service providers, just like we need regulation on any monopolistic market. People often say that the internet is an artifact, not an invention. So we all share a responsibility to make it better, but we should try to do so without idealistic platitudes and misleading slogans.

  1. I’m kidding, of course. ↩︎
  2. This seems to be quite common with any kind of activism where people try to get involved via the internet. ↩︎
  3. This is not the only way to sell networking capacity. Many ISPs charge based on total data transfer and allow you to send/receive as fast as technically possible. This is especially common with cellular ISPs and cloud hosting. ↩︎
  4. “Business transactions” ↩︎
  5. Read all about it on peeringdb.com. ↩︎

Feel bad

The hardest truth I had to learn growing up is that not everyone is interested in the truth, and maybe that’s okay. I’m not talking about material truths like the mounting evidence for climate change or incredible importance of space exploration. There’s nothing to debate about material truths, and while some people dispute them, you can hardly argue that ignoring them is okay. I’m talking about truths about people. I’m talking about whether our laws are fair to the poor. I’m talking about whether certain races or genders or social classes are predisposed to be better or worse at their jobs. I’m talking about whether regulated capitalism is the best we can ask for, whether crime is a racial issue, whether homosexuality is a mental illness, whether the magic sky man really does watch everything you do, whether abortion is baby murder, whether Steve Jobs would like the new iPhone, and whether we’re cold-hearted monsters for selling guns to people who can’t make up their minds about any of these issues in a country where people die from preventable gun violence every day. I want to know why these questions cause so many problems, and I want to know why, in many cases, it seems like the best thing to do is to ignore them.

Here’s some truth for you: I don’t particularly mind if nobody reads this blog, and yet I know that some people do. I know there’s spillover traffic from the number one most popular page on this site, and although I haven’t checked Google Analytics in a few months, I know that the spillover is enough so that every few hours, some poor sucker winds up on one of these blog posts and miraculously reads the whole thing. I could just as happily write to a quiet audience as I could to nobody at all, because ultimately I just like the idea of publishing my thoughts to the public record. Here’s another truth: I’m not wearing pants (I almost never wear pants at home). I finished my dinner and finished washing the dishes and wiping down the counter and now I’m sitting in the dark with no pants on, wondering whether it’s time for a new laptop. More truths: the deadlines at work are getting to my head, even though I won’t admit it. I don’t exactly like my work, but I do like my workplace, and the thought of leaving is too frightening to even consider right now. I feel like my list of things to do has been growing endlessly ever since I graduated from college, and I try not to think of what things will be like in a year, because I’m afraid I won’t like the answer.

When I was around 15 years old, I started getting all kinds of ideas in my head, and the biggest one was that the internet could teach me all kinds of truths that nobody would teach me before. At that moment, I felt like my brain had unlocked an extra set of cores1. I thought that, given enough time, a mind in isolation could discover all the grand truths of the universe. I wondered if that mind could be my mind, and I couldn’t wait to get started.

Some of the discovery was just disgusting stuff. I went on the internet searching for the nastiest, most shocking images I could find, because I thought they would help me better understand how people’s brains worked. After all, images were just a bunch of colorful pixels. I thought it was a form of intellectual impurity to think that some images, and the ideas they depicted, should be off-limits to a curious mind. So, forcible desensitization was the only way.

But I actually spent most of my time reading about science. I learned as much physics as Wikipedia and Flash web applets could teach a teenager…which turned out to be mostly unhelpful when the time came to learn actual physics. I learned about aviation accidents and nuclear proliferation and ancient humans. I learned about the history of American space exploration and how modern religions were developed. I learned that people in the United States lived quite differently than most of the world, in more ways than what wealth alone could explain. But most importantly, I learned (what I thought were) several uncomfortable truths about people, particularly truths that many people denied2.

For many years, I went on carrying these ideas quietly in my head. They made me cynical and disdainful toward other people who didn’t think the way I did, because in my head, they weren’t just different. They were wrong. I can always identify other people that had the same experience growing up (or at least I imagine that I can). They all have the same dead fish look in their eye that says “you’re too afraid to face the truth, but I’m not”. And frankly, I very much still miss those days when all I needed was a computer and a deeply rooted misbelief to disprove.

I’m surprised that I ever grew out of that phase. I suppose everyone grows out of everything, sooner or later. But it’s not that my beliefs have changed or weakened. Rather, I realized that most of my cynicism was based on the false premise that nothing was more important than the truth.

It’s hard to look past a truth-centric view of the world when people spend so much time just trying to prove other people wrong. But as a matter of fact, the ultimate goal (the “objective function”) for a human isn’t how right you are3. People want simpler things. They want to feel good and they want the reassurance that they’ll continue to feel good in the future. They want to feel accepted and live productive purposeful lives. Finding the truth can help advance these objectives, but it is not itself a goal. At the end of the day, most of us are tired from working a full-time job and can’t be bothered to care so much. After all, if a couple of white lies make lots of people happier, who’s to say that that’s not okay?

Parking lot at Castle Rock State Park, California.

  1. An extra lobe of gray matter? ↩︎
  2. Like how the government is mind controlling people with cell towers and chemtrails. ↩︎
  3. Yes, even for people who really enjoy being right. ↩︎

The serving path of Hubnext

I’ve always liked things straight and direct to the point. But I feel a bit silly writing that here, since most of my blog posts from high school used a bunch of long words for no reason. My writing did eventually become more concise, but it happened around the time I also sort of stopped blogging. I stopped writing altogether for a few years, but then in college, I started doing an entirely different kind of writing. I wrote homework assignments and class announcements for the computer science classes as a teaching assistant. Technical writing is different from blogging, but writing for a student audience also has its own unique challenges. The gist of it is that students don’t like reading, which is naturally at odds with the long detailed project specs that some of our projects require. It’s the instructor’s responsibility to make sure the important details stand out from the boring reference material.

In college, I also started eating lunch and dinner on the go regularly, which is something I had never really done before. It was a combination of factors that made walk-eat meals so enticing. The campus was big, so getting from place to place usually involved a bit of walking. Street level restaurants sold fast food in convenient form factors. Erratic scheduling meant it was unusual to be with another person for more than a couple hours at a time, so sit-down meals didn’t always make sense. I got really good at the different styles of walk-eating, from gnawing on burritos without spilling to balancing take-out boxes between my fingers. Eating on my feet felt like the most distilled honest form of feeding. It didn’t make sense to add so many extra steps to accomplish essentially the same task.

For a long time, the serving path of this website was too complicated. The serving path is all the things directly involved in delivering web pages to visitors, which can be contrasted with auxiliary systems that might be important, but aren’t immediately required for the website to work. I made my first website more than 12 years ago. Lots of things have changed since then, both with me and with the ecosystem of web development in general. So when I set out to reimplement RogerHub, I wanted to eliminate or replace as many parts of the serving path as I could.

Mission beach in San Francisco.

Let’s start with the basics. Most of the underlying infrastructure for RogerHub hasn’t changed since I moved the website from Linode to Google Cloud Platform in January 2016. I use Google’s Cloud DNS and HTTP(S) Load Balancing to initially capture inbound traffic. Google’s layer 7 load balancing provides global anycast, geographic load balancing, TLS termination, health checking, load scaling, and HTTP caching, among other things. It’s the best offering available at its low price point, so there’s little reason to consider alternatives1. However, HTTP(S) Load Balancing did have a 2 hour partial outage last October. I don’t get much traffic in October, but the incident made me start thinking of ways I could mitigate a similar outage in the future.

Behind the load balancer, RogerHub is served by a number of functionally identical backend servers (currently 2). These servers are fully capable of serving user traffic directly, but they currently only accept traffic from Google’s designated load balancer IP ranges. During peak seasons, I usually scale this number up to 3, but the extra server costs me about $50 a month, so I usually prefer to run just 2. These extra servers exist entirely for redundancy, not capacity. A single server could serve a hundred times peak load with no problem2.

Each server is a plain old Ubuntu host running an instance of PostgreSQL and an instance of the Hubnext application. That’s it. There’s no reverse proxy, no memcached, no mail server, and not even a provisioner (Hubnext knows how to install and configure its own dependencies). Hubnext itself can run in multiple modes, most of which are for maintenance and debugging. But my web servers start Hubnext in “application” mode. When Hubnext runs in application mode, it starts an ensemble of concurrent tasks, only one of which is the web server. The others are things like an unique ID allocator, peer to peer data syncing, system health checkers, maintenance daemons, and half a dozen kinds of in-memory caches. Hubnext can renew its own TLS certificate, create backups of its data, and even page me on my iPhone when things go wrong. Before Hubnext, these tasks were handled by haphazard cron jobs and independent services, which were set up and configured by a Puppet provisioner. Keeping all these tasks in a single application has made developing new features a lot simpler.

So why does Hubnext still need PostgreSQL? It’s true that Hubnext could have simply kept track of its own data, along with maintaining the various indexes that make common operations fast. But it’s an awful lot of work and unneeded complexity to implement something that a database already does for free. Of all the components of a traditional website’s architecture, I chose to keep the database, because I think PostgreSQL pulls its weight more than any of the other systems that Hubnext does supplant. That being said, Hubnext intentionally doesn’t use the transactions or replication that PostgreSQL provides (e.g. the parts of a database most sensitive to failure). Instead, Hubnext’s data model is designed to work without multi-statement transactions and Hubnext performs its own application level data replication, which is probably easier to configure and troubleshoot than database replication3.

When a request arrives at a Hubnext server, it gets wrapped in a tracker (for logging and metrics) and then enters the web request router. Instead of matching by request path, the Hubnext request router gives each registered web handler a chance to accept responsibility for a request before it falls through to the next handler. This allows for more complex and free-form routing logic. The web router starts with basic things like an OPTIONS handler and DoS rejection. The vast majority of requests are handled by the Hubnext blob cache and page cache. These handlers keep bounded in-memory caches of popular blobs (images, JavaScript, and CSS) and pre-rendered versions of popular pages. But even on a cache hit, the server still runs some code to process things like cache headers, ETags, and compression.

Blobs whose request path starts with /_version_ get long cache expiration times, which instructs Google Cloud CDN to cache them. Overall, about 40% of the requests to RogerHub are served from the CDN without ever reaching Hubnext. You can distinguish these cache hits by their Age header. ETags are generated from the SHA256 hash of the blob’s path and payload. Compression is applied to blobs based on a whitelist of compressible MIME types. The blob cache holds both the original and the gzipped versions of the payload, along with pre-computed ETags and other metadata, like the blob’s modification date.

All of the remaining requests4 eventually make their way to PostgreSQL for content. As I said in a previous blog post, Hubnext stores all of its content in PostgreSQL, including binary blobs like images and PDFs. This isn’t a problem, because Hubnext communicates with its database over a local unix socket and most blobs are cached in memory anyway. This does prevent Hubnext from using sendfile to deliver blobs like filesystem-based web servers do, but there aren’t many large files hosted on RogerHub anyway.

At this point, nearly all requests have already been served by one of the previously mentioned handlers. But the majority of Hubnext’s web server code is dedicated to serving the remaining fraction. This includes archive pages, search results, new comment submissions, the RSS feed, and legacy redirect handlers. If all else fails, Hubnext gives up and returns a 404. The response is sent back to the load balancer over HTTP/1.1 with TLS5 and then forwarded on to the user. Thus completes the request.

Is the serving path shorter than you imagined? I think so6. More code sometimes just means additional maintenance burden and a greater attack surface. But altogether, Hubnext actually consists of far less code than the code it replaces. Furthermore, it’s nearly all code that I wrote and it’s all written in a single language as part of a single system. Say what you want about Go, but I think it’s the best language for implementing traditional CRUD services, especially those with a web component7. Hubnext is a distillation of what it means to deliver RogerHub reliably and efficiently to its audience, without the frills and patches of a traditional website. Anyway, I hope this post was a good distraction. Until next time!

  1. Plus, I happen to work for Google (but I didn’t when I first signed up with GCP), so I hear a lot about these services at company meetings. ↩︎
  2. If required, Hubnext could theoretically run on 10-20 instances with no problem. But the peer to peer syncing protocol is meant to be a fully connected graph, so at some point, it might run into issues with too much network traffic. ↩︎
  3. Hubnext also doesn’t use PostgreSQL’s backup tools, because Hubnext can create application-specific backups that are more convenient and understandable than traditional database backups. ↩︎
  4. Some pages simply aren’t cached, like search result pages and 404s. ↩︎
  5. Google HTTP(S) Load Balancing uses HTTP/1.1 exclusively when forwarding requests to backend servers, although it prefers HTTP/2 on the front end. ↩︎
  6. But for several years, RogerHub just ran as PHP code on cheap shared web hosts, which makes even this setup seem overly complicated. But it’s actually quite a bit of work to run and maintain a modern PHP environment. You’ll probably need a dozen C extensions and a reverse proxy and perhaps a suite of cron jobs to back up all your files and databases. For a long time, I followed a bunch of announcement mailing lists to get news about security fixes in PHP and WordPress and MediaWiki and NGINX and MySQL and all the miscellaneous packages on ubuntu-security-announce. This new serving path means I really only need to keep an eye out on security announcements for Go (which tend to be rare and fairly mild) or OpenSSH or maybe PostgreSQL. ↩︎
  7. After all, that covers a lot of what Google does. ↩︎

Something else

I talk a lot about one day giving up computers and moving out, far, far away to a place where there aren’t any people, in order to become a lumberjack or a farmer. Well lately, it’s been not so much “talking” as instant messaging or just mumbling to myself. Plus, I don’t have the physique to do either of those things. My palms are soft and I’m prone to scraping my arms all the time. I like discipline as a virtue, but I also don’t really like working. And finally, my hobby is proposing stupid outlandish half-joking-half-serious expensive irresponsible plans. It’s fun, and I guess you can’t really say something’s a bad idea until you’ve thought it through yourself.

Joking aside, my motivation comes from truth. Computers are terrible. It becomes more and more apparent to me every year that passes. I want to get far, far away from them. They’re the source of most of my problems1. And lately, it has become clear that lots of other people feel the same way.

Terribleness isn’t the computer’s fault. It all comes from a fundamental disconnect between what people think their computers are and the terrible reality of it all. A massive fraction of digital data is at constant risk of being lost forever, because it’s stored without redundancy on consumer devices scraped from the bottom of the barrel. Critical security vulnerabilities lurk in every computer system. Most of the time, they aren’t discovered simply because nobody has bothered looking hard enough. But given the lightest push, so much software2 just falls apart. You can break all kinds of software just by telling it to do two things at once. Load an article and swipe back at the same time? Bam! Your reddit app UI is now unusable. Submit the same web form from two different tabs? Ta-da! Your session state is now corrupt.

At this moment, my Comcast account is fucked beyond repair after a series of service cancellations, reinstatements, and billing credits put it in an unusual state. I needed to order internet service at my next apartment. I couldn’t do it on the website, probably because of the previous issues. I also didn’t think it was worthwhile explaining to yet another customer service rep over why my account looked the way it did. So, I just made a new account and ordered it there3.

Boats at Fisherman's Wharf in Monterey Bay.

You can get rid of all these problems if you try hard enough. Some people simply don’t own any data. Their email and photos and documents and (nonexistent) depositories of old code just come and go along with the devices they’re stored on. They don’t have data at risk of compromise, because they don’t have any digital secrets and important computer systems (like their bank) just have humans double-checking things every step of the way.

Alternatively, you could put expiration dates on your data. Email keeps for 2 years. Photos: five. Sort your data into annual folders, and when the expiration date passes, simply delete it. This strategy takes the focus off of maintaining perfect data integrity and opens up a new method to measure your success. At the end of the year, if you never had trouble finding an old photo and your credit score was doing alright, then declare success.

Maybe you could live somewhere where people don’t really need computers. Maybe they have a phone that only makes calls and a nice big TV and maybe some books and paper and stuff to do outside. Or maybe they have iPads too—little TV’s that you can touch.

Or you could simply come to terms with the way things are. After all, your data only needs to last as long as your own body does.

A rusty old tractor.

I stayed on a farm over Memorial Day weekend. It was owned by a couple. They had a tractor and some trucks and dogs and a jaccuzi and two little huts to rent out on Airbnb. I wonder if they had computers at their farm, or if they just had iPads.

Do they enjoy having a lot of land? Or is it wearisome waking up to the same rusting trucks and dirt roads every day? They probably have a lot of privacy, on account of their lack of upstairs, downstairs, and adjacent neighbors in their non-apartment home. Although, that’s probably less true if internet strangers are renting out their cabins all the time.

They had rows of crops, like the ones you see along highway 5 in central California. I wasn’t sure exactly who they belonged to, since their nearest neighbor was a solid 10 minute walk down the road. I wonder if they’re all friends.

My Subaru Impreza parked outside a cabin.

I spend a lot of my weekends working on my personal infrastructure. I can’t quite explain what it is I’m working on or why it needs work at all. It’s a combination of data management, software sandboxing, and build systems. At some point, I have to wonder if I’m spending more time working on my tools than actually using those tools to do stuff.

I also have a bunch of to-do’s on my Todoist account. The length of those lists has grown considerably since I graduated college a year ago4. I can’t believe it’s already been more than a year since then. How soon will it be two years? Did I live this past year right? Did I accomplish enough things?

Fields in Salinas.

I guess the answer really depends on how you measure “enough”. A year ago, I told myself that I wanted to become the best computer programmer that I could be. I feel like I’ve made lots of progress on that front. I picked up lots of good habits at work and my personal infrastructure is better than it’s ever been. I did a lot of cool stuff with RogerHub, and it finally has infrastructure that I can be proud of. But somehow, I’m also more dissatisfied with my computing than ever before.

I’m still living in the same area, and I’ve just committed to living here for yet another year. My plan had always been to stay in the Bay Area for three or four years, and then move away. But where to, and what for? Right now, I can’t even imagine what it’d be like to leave California5.

I’ve gotten better at cooking. More specifically, I think I’ve gotten more familiar with what kinds of food I like. On most days, I cook dinner for myself at home, and then I watch YouTube and browse the internet until I have to go to sleep. If it’s May or December, I also have to answer a few math questions.

But I haven’t made any new friends in the last year, other than my direct coworkers. I’ve spent more free time at home than ever before. I spend a small fortune on rent, after all. I might as well try to get some value out of it.

Bread and strawberries on a table.

Three years ago, I remember thinking that I didn’t know any adults who were actually happy, from which I concluded that growing up just kind of sucked.

I rarely drive anymore. I used to think driving at night on empty roads was nice and relaxing. But the roads around here just feel crowded and dangerous.

I’ll continue to work on computers, because that’s all I’m good at. More often than not, the last thought in my head before I fall asleep is an infinite grid of computers. On the bright side, I fixed my chair today and I bought a new toilet seat. As long as there’s a trickle of new purchases arriving at my doorstep, maybe things will be alright.

This post was kind of a downer. I’ll return with more website infrastructure next week6.

And since I haven’t posted a photo of me in a while, here’s a recent one:

Roger sitting by the ocean.

The photo on my home page doesn’t even look like me anymore. But I like it as an abstract icon, so it’ll probably stay there for a while, until I change it to something else.

  1. Which, I suppose, is pretty fortunate compared to the alternative. ↩︎
  2. You might blame software instead of the computer itself, but to most people, they’re the same thing. ↩︎
  3. I’m grateful I have such easy access to additional email addresses and phone numbers. ↩︎
  4. In fact, I only recently discovered that Todoist imposes a 200 task limit per project. ↩︎
  5. In any case, I realized this past weekend that I have way too many physical belongings to haul around, if I ever want to move. ↩︎
  6. I backdated this to July 31st, since that’s when I wrote most of the content. ↩︎

Training angst

Have you ever used Incognito Mode because you wanted to search for something weird, but you didn’t want it showing up in targeted ads? Or have you ever withheld a Like from a YouTube video, because although you enjoyed watching it, you weren’t really interested in being recommended more of the same? I have. And since I can’t hear you, I’ll assume you probably have too. People have gotten accustomed to the idea of “training” their computers to behave how they want, much like you’d train a dog or your nephew. And whether you study computer science or psychology or ecology or dog stuff, the principles of reinforcement learning are all about the same.

The reason you don’t search weird stuff while logged in or thumbs-up everything indiscriminately is that you’re trying to avoid setting the wrong example. But occasional slip ups are a fact of life. To compensate, many machine learning models offer mechanisms to deal with erroneous (”noisy”) labels in training data. The constraints of a soft margin SVM include a slack term that represents a trade-off between classification accuracy and resilience against mislabeled examples. Because computers can’t tell which of its training examples are mislabeled and which are simply unusual, it does the next best thing: each example can be rated based on how “believable” it is in comparison to other examples. Then, finding the optimal parameters is simply a matter of minimizing unbelievability.

Avoiding bad examples is in your best interest if you want the algorithm to give you the best recommendations. So, your YouTube Liked Videos list is probably only a rough approximation of the videos that you actually like1. Now, a computer algorithm won’t mind if you lie (a little) to it. But the real tragedy is that as a side effect, YouTube has effectively trained you, the user, to give Likes not to the videos you actually like, but to the kinds of videos you want recommendations for. In fact, most kinds of behavioral training induce this kind of reverse effect. The trainer lies a little bit to the trainee, in order to push the training in the right direction. And in return, the trainer drifts a little farther from the truth.

Parents do this to their children. Friends do it to their friends. Even if you try to be honest, the words you say and the reactions you make end up deviating ever so slightly from the truth, because you can’t help but think that your actions will end up somebody’s brain as subconscious behavioral training examples2. If your friend invites you to do something you don’t want to do, maybe you’ll say yes, or else they might not even ask next time. And if they say something you don’t like, maybe you’ll act angrier than you really are, so they won’t mention it ever again. Every decision starts with “how do I feel about this?”, but is quickly followed up with “how will others feel about my feeling about this?”. This isn’t plain old empathy. It’s true that human-to-human behavioral training helps people get along with each other. But when our words and actions are influenced by how we think they’ll affect someone else’s behavior, they end up being fundamentally just another form of lying. And unlike a computer recommendation algorithm, people might actually hate you for lying to them.

This has been weighing on my mind for a long time. I think it’s unbearably hard for adults to be emotionally honest with each other, even for close friends or family. But the problem isn’t with the words we say. Of course, people want others to think of them a certain way, whether it’s about your money or your job or your passions or romance or mental health. And people lie about those things all the time. That isn’t what keeps me up at night. What bothers me is that even when you’re trying your best to be absolutely honest with someone, you can’t. You say the right words, but they don’t sound right. You feel the right feelings, but your face isn’t cooperating. Your eyes get hazy from years of emotional cruft, and you’re no longer able to really see the person right in front of you. And it’s all because we spend every day training each other with almost-truths.

A flower.

  1. The Like button is actually short for the “recommend me more videos Like this one” button. ↩︎
  2. Have you watched Bojack Horseman? ↩︎

Data loss and you

My laptop’s hard drive crashed in 2012. I was on campus walking by Evans Hall, when I took my recently-purchased Thinkpad x230 out of my backpack to look up a map (didn’t have a smartphone), only to realize it wouldn’t boot. This wasn’t a disaster by any means. It set me back $200 to rush-order a new 256GB Crucial M4 SSD. But since I regularly backed up my data to an old desktop running at my parent’s house, I was able to restore almost everything once I received it1.

I never figured out why my almost-new laptop’s hard drive stopped working out of the blue. The drive still spun up, yet the system didn’t detect it. But whether it was the connector or the circuit board, that isn’t the point. Hardware fails all the time for no reason2, and you should be prepared for when it happens.

Data management has changed a lot in the last ten years, primarily driven by the growing popularity of SaaS (”cloud”) storage and greatly improved network capacity. But one thing that hasn’t changed is that most people are still unprepared for hardware failure when it comes to their personal data. Humans start manufacturing data from the moment they’re born. Kids should really be taught data husbandry, just like they’re taught about taxes and college admissions and health stuff. But anyway, here are a few things I’ve learned about managing data that I want to share:

Identify what’s important

Data management doesn’t work if you don’t know what you’re managing. In other words, what data would make you sad if you lost access to it? Every day, your computer handles massive amounts of garbage data: website assets, Netflix videos, application logs, PDFs of academic research, etc. There’s also the data that you produce, but don’t intend to keep long-term: dash cam and surveillance footage (it’s too big), your computer settings (it’s easy to re-create), or your phone’s location history (it’s too much of a hassle to extract).

For most people, important data is the data that’s irreplaceable. It’s your photos, your notes and documents, your email, your tax forms, and (if you’re a programmer) your enormous collection of personal source code.

Consider the threats

It’s impossible to predict every possible bad thing that could happen to your data. But fortunately, you don’t have to! You can safely ignore all the potential data disasters that are significantly less likely to occur than your own untimely death3. That leaves behind a few possibilities, roughly in order of decreasing likelihood:

  • Hardware failure
  • Malicious data loss (somebody deletes your shit)
  • Accidental data loss (you delete your shit)
  • Data breach (somebody leaks your shit)
  • Undetected data degradation

Hardware failures are the easiest to understand. Hard drives (external hard drives included), solid state drives, USB thumb drives, and memory cards all have an approximate “lifespan”, after which they tend to fail catastrophically4. The rule of thumb is 3 years for external hard drives, 5 years for internal hard drives, and perhaps 10 years for enterprise-grade hard drives.

Malicious data loss has become much more common these days, with the rise of a digital extortion scheme known as “ransomware”. Ransomeware encrypts user files on an infected machine, usually using public-key cryptography in at least one of the steps. The encryption is designed so that the infected computer can encrypt files easily, but is unable to reverse the encryption without the attacker’s cooperation (which is usually made available in exchange for a fee). Fortunately, ransomeware is easily detectable, because the infected computer prompts you for money once the data loss is complete.

On the other hand, accidental data loss can occur without anybody noticing. If you’ve ever accidentally overwritten or deleted a file, you’ve experienced accidental data loss. Because it can take months or years before accidental data loss is noticed, simple backups are sometimes ineffective against it.

Data breaches are a unique kind of data loss, because it doesn’t necessarily mean you’ve lost access to the data yourself. Some kinds of data (passwords, tax documents, government identification cards) lose their value when they become available to attackers. So, your data management strategy should also identify if some of your data is condential.

Undetected data degradation (or “bit rot”) occurs when your data becomes corrupted (either by software bugs or by forces of nature) without you noticing. Modern disk controllers and file systems can provide some defense against bit rot (for example, in the case of a bad sectors on a hard disk). But the possibility remains, and any good backup strategy needs a way to detect errors in the data (and also to fix them).

Things you can’t backup

Backups and redundancy are generally the solutions to data loss. But you should be aware that there are some things you simply can’t backup. For example:

  • Data you interact with, but can’t export. For example, your comments on social media would be difficult to backup.
  • Data that’s useless (or less useful) outside of the context of a SaaS application. For example, you can export your Google Docs as PDFs or Microsoft Word files, but then they’re no longer Google Docs.

Redundancy vs backup

Redundancy is buying 2 external hard drives, then saving your data to both. If either hard drive experiences a mechanical failure, you’ll still have a 2nd copy. But this isn’t a backup.

If you mistakenly overwrite or delete an important file on one hard drive, you’ll probably do the same on the other hard drive. In a sense, backups require the extra dimension of time. There needs to be either a time delay in when your data propagates to the backup copy, or better yet, your backup needs to maintain multiple versions of your data over time.

RAID and erasure encoding both offer redundancy, but do not count as a backup.

Backups vs archives

Backups are easier if you have less data. You can create archives of old data (simple ZIP archives will do) and back them up separately from your “live” data. Archives make your daily backups faster and also make it easier to perform data scrubbing.

When you’re archiving data, you should pick an archive format that will still be readable in 30 to 50 years. Proprietary and non-standard archive tools might fall out of popularity and become totally unusable in just 10 or 15 years.

Data scrubbing

One way to protect against bit rot is to check it periodically against known-good versions. For example, if you store cryptographic checksums with your files (and also digitally sign the checksums), you can verify the checksums at any time and detect bit rot. Make sure you have redundant copies of your data, so that you can restore corrupted files if you detect errors.

I generate SHA1 checksums for my archives and sign the checksums with my GPG key.

Failure domain

If your backup solution is 2 copies on the same hard drive, or 2 hard drives in the same computer, or 2 computers in the same house, then you’re consolidating your failure domain. If your computer experiences an electrical fire or your house burns down, then you’ve just lost all copies of your data.

Onsite vs offsite backups

Most people keep all their data within a 20 meter radius of their primary desktop computer. If all of your backups are onsite (e.g. in your home), then a physical disaster could eliminate all of the copies. The solution is to use offsite backups, either by using cloud storage (easy) or by stashing your backups at a friend’s house (pain in the SaaS).

Online vs offline backups

If a malicious attacker gains access to your system, they can delete your data. But they can also delete any cloud backups5 and external hard drive backups that are accessible from your computer. It’s sometimes useful to keep backups of your data that aren’t immediately deletable, either because they’re powered off (like an unplugged external hard drive) or because they’re read-only media (like data backups on Blu-ray Discs).

Encryption

You can reduce your risk of data leaks by applying encryption to your data. Good encryption schemes are automatic (you shouldn’t need to encrypt each file manually) and thoroughly audited by the infosec community. And while you’re at it, you should make use of your operating system’s full disk encryption capabilities (FileVault on macOS, BitLocker on Windows, and LUKS or whatever on Linux).

Encrypting your backups also means that you could lose access to them if you lose your encryption credentials. So, make sure you understand how to recover your encryption credentials, even if your computer is destroyed.

Online account security

If you’re considering cloud backups, you should also take steps to strengthen the security of your account:

  • Use a long password, and don’t re-use a password you’ve used on a different website.
  • Consider using a passphrase (a regular english sentence containing at least 4-5 uncommon words). Don’t share similar passphrases for multiple services (like “my facebook password”), because an attacker with access to the plaintext can easily guess the scheme.
  • Turn on two-factor authentication. The most common 2FA scheme (TOTP) requires you to type in a 6-8 digit code whenever you log in. You should prefer to use a mobile app (I recommend Authy) to generate the code, rather than to receive the code via SMS. Don’t forget to generate backup codes and store them in a physically secure top-secret location (e.g. underneath the kitchen sink).
  • If you’re asked to set security questions, don’t use real answers (they’re too easy to guess). Make up gibberish answers and write them down somewhere (preferably a password manager).
  • If your account password can be recovered via email, make sure your email account is also secure.

Capacity vs throughput

One strong disadvantage of cloud backups is that transfers are limited to the speed of your home internet, especially for large uploads. Backups are less useful when they take days or weeks to restore, so be aware of how your backup throughput affects your data management strategy.

This problem also applies to high-capacity microSD cards and hard drives. It can take several days to fully read or write a 10TB data archival hard drive. Sometimes, smaller but faster solid state drives are well worth the investment.

File system features

Most people think of backups as “copies of their files”. But the precise definition of a “file” has evolved rapidly just as computers have. File systems have become very complex to meet the increasing demands of modern computer applications. But the truth remains that most programs (and most users) don’t care about most of those features.

For most people, your “files” refers to (1) the directory-file tree and (2) the bytes contained in each file. Some people also care about file modification times. If you’re a computer programmer, you probably care about file permission bits (perhaps just the executable bit) and maybe symbolic links.

But consider this (non-exhaustive) list of filesystem features, and whether you think they need to be part of your data backups:

  • Capitalization of file and directory names
  • File owner (uid/gid) and permission bits, including SUID and sticky bits
  • File ACLs, especially in an enterprise environment
  • File access time, modification time, and creation time
  • Extended attributes (web quarantine, Finder.app tags, “hidden”, and “locked”)
  • Resource forks, on macOS computers
  • Non-regular files (sockets, pipes, character/block devices)
  • Hard links (also “aliases” or “junctions”)
  • Executable capabilities (maybe just CAP_NET_BIND_SERVICE?)

If your answer is no, no, no, no, no, what?, no, no, and no, then great! The majority of cloud storage tools will work just fine for you. But the unfortunate truth is that most computer programmers are completely unaware of many of these file system features. So, they write software that completely ignores them.

Programs and settings

Programs and settings are often left out of backup schemes. Most people don’t have a problem reconfiguring their computer once in a while, because catastrophic failures are unlikely. If you’re interested in creating backups of your programs, consider finding a package manager for your preferred operating system. Computer settings can usually be backed up with a combination of group policy magic for Windows and config files or /usr/bin/defaults for macOS.

Application-specific backup

If you’re backing up data for an application that uses a database or a complex file-system hierarchy, then you might be better served by an backup system that’s designed specifically for that application. For example, RogerHub runs on a PostgreSQL database, which comes with its own backup tools. But RogerHub uses an application-specific backup scheme that’s tailored to RogerHub specifically.

Testing

A backup isn’t a backup until you’ve tested the restoration process.

Recommendations

If you’ve just skipped to the end to read my recommendations, fantastic! You’re in great company. Here’s what I suggest for most people:

  • Use cloud services instead of files, to whatever extent you feel comfortable with. It’s most likely not worth your time to backup email or photos, since you could use Google Inbox or Google Photos instead.
  • Create backups of your files regularly, using the 3-2-1 rule: 3 copies of your data, on 2 different types of media, with at least 1 offsite backup. For example, keep your data on your computer. Then, back it up to an online cloud storage or cloud backup service. Finally, back up your data periodically to an external hard drive.
  • Don’t trust physical hardware. It doesn’t matter how much you paid for it. It doesn’t matter if it’s brand new or if you got the most advanced model. Hardware breaks all the time in the most unpredictable ways.
  • Don’t buy an external hard drive or a NAS as your primary backup destination. They’re probably no more reliable than your own computer.
  • Make sure to use full-disk encryption and encrypted backups.
  • Make sure nobody can maliciously (or accidentally) delete all of your backups, simply by compromising your primary computer.
  • Consider making archives of data that you use infrequently and no longer intend to modify.
  • Secure your online accounts (see section titled “Online account security”)
  • Pat yourself on the back and take a break once in a while. Data management is hard stuff!

If you find any mistakes on this page, let me know. I want to keep it somewhat updated.

And, here’s yet another photo:

Branches.

  1. My laptop contained the only copy of my finished yet unsubmitted class project. But technically I had a project partner. We didn’t actually work together on projects. We both finished each project independently, then just picked one version to submit. ↩︎
  2. About four and a half years later, that m4 stopped working and I ordered a MX300 to replace it. ↩︎
  3. That is, unless you’re interested in leaving behind a postmortem legacy. ↩︎
  4. There are other modes of failure other than total catastrophic failure. ↩︎
  5. Technically, most reputable cloud storage companies will keep your data for some time even after you delete it. If you really wanted to, you could explain the situation to your cloud provider, and they’ll probably be able to recover your cloud backups. ↩︎

Life lessons from artificial intelligence

If you speak to enough software engineers, you’ll realize that many of them can’t understand some everyday ideas without using computer metaphors. They say “context switching” to explain why it’s hard to work with interruptions and distractions. Empathy is essentially machine virtualization, but applied to other people’s brains. Practicing a skill is basically feedback-directed optimization. Motion sickness is just your video processor overheating, and so on.

A few years ago, I thought I was the only one whose brain used “computer” as its native language. And at the time, I considered this a major problem. I remember one summer afternoon, I was playing scrabble with some friends at my parents’ house. At that time, I had just finished an internship, where day-to-night I didn’t have much to think about other than computers. And as I stared at my scrabble tiles, I realized the only word I could think of was EEPROM1.

It was time to fix things. I started reading more. I’ve carried a Kindle in my backpack since I got my first Kindle2 in high school, but I haven’t always used it regularly. It’s loaded with a bunch of novels. I don’t like non-fiction, especially the popular non-fiction about famous politicians and the economy and how to manage your time. It seems like a waste of time to read about reality, when make-believe is so much more interesting.

I also started watching more anime. I especially like the ones where the main character has a professional skill and that skill becomes a inextricable part of their personal identity3. During my last semester in college, I thought really hard about whether I really wanted to just be a computer programmer until I die, or whether I simply had no other choice, because I wasn’t good at anything else. And so, I watched Hibike! Euphonium obsessively, searching for answers.

Devoting your life to a skill can be frustrating. It makes you wonder if you’d be a completely different person if that part of you were suddenly ripped away. And then there’s the creeping realization that your childhood passion is slowly turning into yet another boring adult job. It’s like when you’re a kid and you want to be the strongest ninja in your village, but then you grow up and start working as a mercenary. You can still do ninja stuff all day, but it’s just not fun anymore.

But I like those shows because it’s inspiring and refreshing to watch characters who really care about being really good at something, as long as that something isn’t just “make a ton of money”. I think it’s important to have passion and a competitive spirit for at least one thing. It’s no fun being just mediocre at a bunch of things. Plus, being good at something gives you a unique perspective on the world, and that perspective comes with insights worth sharing.

I thought a lot about Q-learning during the months after my car accident. I think normal people are generally unprepared to respond rationally in crisis situations. And that’s at least partially because most of us haven’t spent enough time evaluating the relative cost of all the different terrible things that might happen to us on a day to day basis. Q-learning is a technique for decision-making that relies on predicting the expected value of taking an action in a particular state. In order for Q-learning to work, you need models for both the state transitions (what could happen if I take this action?) and a cost for each of the outcomes. If you understand the transitions, but all of your costs are just “really bad, don’t let that happen”, then in a pinch, it becomes difficult to decide which bad outcome is the least terrible.

There are little nuggets of philosophy embedded all over the fields of artificial intelligence and machine learning. I skipped a lot of class in college, but I never skipped my introductory AI and ML classes. It turns out that machine learning and human learning have a lot in common. Here are some more ideas, inspired by artificial intelligence:

I try to spend as little time as possible shopping around before buying something, and that’s partially because of what’s called the Optimizer’s Curse4. The idea goes like this: Before buying something, you usually look at all your options and pick the best one. Since people aren’t perfect, sometimes you overestimate or underestimate how good your options are. The more options you consider, the higher the probability that the perceived value of your best option will be much greater than its actual value. Then, you end up feeling disappointed, because you bought something that isn’t as good as you thought it’d be.

Now that doesn’t mean you should just buy the first thing you see, since your first option might turn out to be really shitty. But if you’re reasonably satisfied with your options, it’s probably best to stop looking and just make your choice.

But artificial intelligence also tells us that it’s not smart to always pick the best option. Stochastic optimization methods are based on the idea that, sometimes, you should take suboptimal actions just to experience the possibilities. Humans call this “stepping out of your comfort zone”. Machines need to strike a balance between “exploration” (trying out different options to see what happens) and “exploitation” (using experience to make good decisions) in order to succeed in the long run. This balance is called the “learning rate”, and a good learning rate decreases over time. In other words, young people are supposed to make poor decisions and try new things, but once you get old, you should settle down5.

The difference in cumulative value resulting from sub-optimal decisions is known as “regret”. In the long run, machines should learn the optimal policy for decision-making. But machines should also try to reach this optimum with as little regret as possible. This is accomplished by adjusting the learning rate.

So is it wrong for parents to make all of their children’s decisions? A little guidance is probably valuable, but a too conservative learning rate converges to a suboptimal long-term policy6. I suppose kids should act like kids, and if they scrape their knees and do stupid stuff and get in trouble, that’s probably okay.

Anyway, there’s one more artificial intelligence technique that I don’t understand too well, but it comes with interesting implications for humans. It’s a technique for path planning applied to finite LQR problems, which are a type of problem where the system mechanics can be described linearly and the cost function is quadratic with the state. These restrictions yield a formulation that lets us compute a policy that is independent of the state of the system. In other words, the machine plans a path by starting at the goal, then working backward to determine what leads up to that goal.

The same policy can be applied no matter your goal (”terminal condition”), because all the mechanics of the system are encoded in the policy. For example, if your goal is to build rockets at NASA, then it’s useful to consider what needs to happen one day, one month, or even one year before your dream comes true. The policy becomes less and less useful when the distance to your goal increases, but by working backward far enough, you can figure out what to do tomorrow to take the first step.

And if your plans don’t work out, well don’t worry, because the policy is independent of the state of the system. You can reevaluate your trajectory at any point to put yourself back on the right track7.

I miss learning signal processing and computer graphics and machine learning and all of these classes with a lot of math in them. I work on infrastructure and networking at work, which is supposedly my specialization. But I also feel like I’m missing out on a lot of great stuff that I used to be interested in. The math-heavy computer science courses always felt a little more legit. I always imagined college to be a lot of handwriting and equations and stuff. Maybe I’ll pick up another side project for this stuff soon.

And here’s a photo of the hard disk from my first laptop:

A hard disk lying on some leaves.

It died less than a month after I got the laptop. After that, I started backing up my data more religiously. Plus, I replaced the spinning rust with a new Crucial M4 and that lasted for about 4.5 years until it broke too. I still kept this hard drive chassis and platter, because it looks cool.

  1. Acronyms aren’t allowed anyway. ↩︎
  2. My first Kindle was a 3rd generation Kindle Keyboard. When I broke that one, I bought another Kindle Keyboard even though a newer model had been released. I didn’t want my parents to notice I had broken my Kindle so soon after I got it, so I hid the old Kindle in a manilla envelope and used its adopted brother instead. Three years later, I upgraded to the Paperwhite, and that’s still in my backpack today. ↩︎
  3. See this or this. ↩︎
  4. But also partially because I’m a lazy bastard. ↩︎
  5. And yet, I haven’t left my apartment all weekend. ↩︎
  6. PAaaS: parenting advice as a service. ↩︎
  7. On second thought, this doesn’t have much to do with artificial intelligence. ↩︎

The data model of Hubnext

I got my first computer when I was 8. It was made out of this beige-white plastic and ran a (possibly bootlegged) copy of Windows ME1. Since our house had recently gotten DSL installed, the internet could be on 24 hours a day without tying up the phone line. But I didn’t care about that. I was perfectly content browsing through each of the menus in Control Panel and rearranging the files in My Documents. As long as I was in front of a computer screen, I felt like I was in my element and everything was going to be alright.

Computers have come a long way. Today, you can rent jiggabytes of data storage for literally pennies per month (and yet iPhone users still constantly run out of space to save photos). For most people living in advanced capitalist societies, storage capacity has been permanently eliminated as a reason why you might consider deleting any data at all. For people working in tech, there’s a mindset known as “big data”, where businesses blindly hoard all of their data in the hope that some of it will become useful at some time in the future.

On the other hand, I’m a fan of “small data”. It’s the realization that, for many practical applications, the amount of useful you have is dwarfed by the overwhelming computing and storage capacity of modern computers. It really doesn’t matter how inefficient or primitive your programs are, and that opens up a world of opportunities for most folks to do ridiculous audacious things with their data2.

When RogerHub ran on WordPress, I set up master-slave database and filesystem replication for my primary and replica web backends. WordPress needs to support all kinds of ancient shared hosting environments, so WordPress core makes very few assumptions about its operating environment. But WordPress plugins, on the other hand, typically make a lot of assumptions about what kinds of things the web server is allowed to do3. So the only way to really run WordPress in a highly-available configuration is to treat it like a black box and try your best to synchronize the database and filesystem underneath it.

RogerHub has no need for all of that complexity. RogerHub is small data. Its 38,000 comments could fit in the system memory of my first cellphone4 and the blobs could easily fit in the included external microSD card. But perhaps more important than the size of the data is how simple RogerHub’s dataset is.

Database replication comes with its own complexities, because it assumes you actually need transaction semantics5. Filesystem replication is mostly a crapshoot with no meaningful conflict resolution strategy for applications that use disk like a lock server. But RogerHub really only collects one kind of data: comments. The nice thing about my comments is that they have no relationship to each other. You can’t reply directly to other comments. Adding a new comment is as simple as inserting it in chronological order. So theoretically, all of this conflict resolution mumbo jumbo should be completely unnecessary.

I call the new version of RogerHub “hubnext” internally6. Hubnext stores all kinds of data: comments, pages, templates7, blobs8, and even internal data, like custom redirects and web certificates. Altogether, these different kinds of data are just called “Things”.

One special feature of hubnext is that you can’t modify or delete a Thing, once it has been created (e.g. an append-only data store). This property makes it really easy to synchronize multiple sets of Things on different servers, since each replica of the hubnext software just needs to figure out which of its Things the other replicas don’t have. To make synchronization easier, each Thing is given a unique identifier, so hubnext replicas can talk about their Things by just using their IDs.

Each hubnext replica keeps a list of all known Thing IDs in memory. It also keeps a rolling set hash of the IDs. It needs to be a rolling hash, so that it’s fast to compute H(n1, n2, …, nk, nk+1), given H(n1, n2, …, nk) and nk+1. And it needs to be a set hash, so that the order of the elements doesn’t matter. When a new ID set added to the list of Thing IDs, the hubnext replica computes the updated hash, but it also remembers the old hash, as well as the ID that triggered the change. By remembering the last N old hashes and the corresponding Thing IDs, hubnext builds a “trail of breadcrumbs” of the most recently added IDs. When a hubnext replica wants to sync with a peer, it sends its latest N hashes through a secure channel. The peer searches for the most recent matching hash that’s in both the requester’s hashes and the peer’s own latest N hashes. If a match is found, then the peer can use its breadcrumbs to generate a “delta” of newly added IDs and return them back to the requester. And if a match isn’t found, the default behavior is to assume the delta should include the entire set of all Thing IDs.

This algorithm runs periodically on all hubnext replicas. It’s optimized for the most common case, where all replicas have identical sets of Thing IDs, but it also works well for highly unusual cases (for example, when a new hubnext replica joins the cluster). But most of the time, this algorithm is completely unnecessary. Most writes (like new comments, new blog posts, etc) are synchronously pushed to all replicas simultaneously, so they become visible to all users globally without any delay. The synchronization algorithm is mostly for bootstrapping a new replica or catching up after some network/host downtime.

To make sure that every Thing has a unique ID, the cluster also runs a separate algorithm to allocate chunks of IDs to each hubnext replica. The ID allocation algorithm is an optimistic majority consensus one-phase commit with randomized exponential backoff. When a hubnext replica needs a chunk of new IDs, it proposes a desired ID range to each of its peers. If more than half of the peers accept the allocation, then hubnext adds the range to its pool of available IDs. If the peers reject the allocation, then hubnext just waits a while and tries again. Hubnext doesn’t make an attempt to release partially allocated IDs, because collisions are rare and we can afford to be wasteful. To decide whether to accept or reject an allocation, each peer only needs to keep track of one 64-bit ID, representing the largest known allocated ID. And to make the algorithm more efficient, rejections will include the largest known allocated ID as a “hint” for the requester.

There are some obvious problems with using an append-only set to serve website content directly. To address these issue, each Thing type contains (1) a “last modified” timestamp and (2) some unique identifier that links together multiple versions of the same thing. For blobs and pages, the identifier is the canonicalized URL. For templates, it’s the template’s name. For comments, it’s the Thing ID of the first version of the comment. When the website needs to fetch some website content, it only considers the instance of the data with the latest “last modified” timestamp among multiple Things with the same identifier.

Overall, I’m really satisfied with how this data storage model turned out. It makes a lot of things easier, like website backups, importing/exporting data, and publishing new website content. I intentionally glossed over the database indexing magic that makes all of this somewhat efficient, but that’s nonetheless present. There’s also an in-memory caching layer for the most commonly-requested content (like static versions of popular web pages and assets). Plus, there’s some Google Cloud CDN magic in the mix too.

It’s somewhat unusual to store static assets (like images and javascript) in a relational database. The only reason why I can get away with it is because RogerHub is small data. The only user-produced content is plaintext comments, and I don’t upload nearly enough images to fill up even the smallest GCE instances.

Anyway, have a nice Friday. If I find another interesting topic about Hubnext, I’ll probably write another blog post like this one soon.

A bridge in Kamikochi, Japan.

  1. But not for long, because I found install disks for Windows 2000 and XP in the garage and decided to install those. ↩︎
  2. I once made a project grading system for a class I TA’ed in college. It ran on a SQLite database with a single global database lock, because that was plenty fast for everybody. ↩︎
  3. Things like writing to any location in the web root and assuming that filesystem locks are real global locks. ↩︎
  4. a Nokia 5300 with 32MB of internal flash ↩︎
  5. I’ve never actually seen any WordPress code try to use a transaction. ↩︎
  6. Does “internally” even mean anything if it’s just me? ↩︎
  7. Templates determine how different pages look and feel. ↩︎
  8. Images, stylesheets, etc. ↩︎

What’s “next” for RogerHub

Did I intentionally use 3 different smart quotes in the title? You bet I did! But did it require a few trips to fileformat.info and some Python to figure out what the proper octal escape sequences are? As a matter of fact, yes. Yes it did. And if you’re wondering, they’re \342\200\231, \342\200\234, and \342\200\2351.

The last time I rewrote RogerHub.com was in November of 2010, more than 6 years ago. Before that, I was using this PHP/MySQL blogging software that I wrote myself. RogerHub ran on cheap shared hosting that cost $44 USD per year. I moved the site to WordPress because I was tired of writing basic features (RSS feeds, caching, comments, etc.) myself. The whole migration process took about a week. That includes translating my blog theme to WordPress, exporting all my articles2, and setting up WordPress via 2000s-era web control panels and FTP.

Maybe it’s that time again? The time when I’m unhappy with my website and need to do something drastic to change things up.

To be fair, my “personal blog” doesn’t really feel like a blog anymore. Since RogerHub now gets anywhere between 217 to 221 visitors per month, it demands a lot more of my attention than a personal blog really should. During final exam season, I log onto my website every night to collect my reward: a day’s worth of final exam questions and outdated memes3. Meanwhile, I wrote 3 blog posts last year and just 1 the year before that.

I want to take back my blog. And I want to strategically reduce the amount of time I spend managing the comments section without eliminating them altogether. Lately I’ve been too scared to make changes to my blog, because of how it might break other parts of the site. On top of that, I have to build everything within the framework of WordPress, an enormous piece of software written by strangers in a language that gives me no pleasure to use. I miss when it didn’t matter if I broke everything for a few hours, because I was editing my site directly in production over FTP. And every time WordPress announces a new vulnerability in some JSON API or media attachments (all features that I don’t use), I miss running a website where I owned all of the code.

So on nights and weekends over the last 5 months, I’ve been working on a complete rewrite of RogerHub from the ground up. And you’re looking at it right now.

Why does it look exactly the same as before? Well, I lied. I didn’t rewrite the frontend or any of the website’s pages. But all the stuff under the hood that’s responsible for delivering this website to your eyeballs has been replaced with entirely new code4.

The rewrite replaces WordPress, NGINX, HHVM, Puppet, MySQL, and all the miscellaneous Python and Bash scripts that I used to maintain the website. RogerHub is now just a single Go program, running on 3 GCE instances, each with a PostgreSQL database, fronted by Google Cloud Load Balancer.

Although this website looks the same, I’ve made a ton of improvements behind the scenes that’ll make it easier for me to add features with confidence and reduce the amount of toil I perform to maintain the site. I’ll probably write more about the specifics of what’s new, but one of the most important things is that I can now easily run a local version of RogerHub in my apartment to test out new changes before pushing them live5. I’ve also greatly improved my rollout and rollback processes for new website code and configuration.

Does this mean I’ll start writing blogs again? Sure, probably.

I’m not done with the changes. I’ve only just finished the features that I thought were mandatory before I could migrate the live site over to the new system. I performed the migration last night and I’ve been working on post-migration fixes and cleanup all day today. It’s getting late, so I should just finish this post and go to sleep. But I’ll leave you with this nice photo. I used to end these posts with funny comics and reddit screencaps.

Tree branches and flowers in the fog.

It’s a little wider than usual, because I’m adding new features, and this is the first one.

  1. Two TODOs for me: memorize those escape codes and add support for automatic smart quotes in post titles ↩︎
  2. I used Google Sheets to template a long list of SQL queries, based on a phpMyAdmin dump that I copied and pasted into a spreadsheet. Then, I copied those SQL queries back into phpMyAdmin to import everything into WordPress. ↩︎
  3. By my count, I’ve answered more than 5,000 questions so far. The same $44 annual website fee is enough to run 2017’s RogerHub.com for about 2 weeks. ↩︎
  4. And that’s a big deal, I swear! ↩︎
  5. Gee, it’s 2017. Who would have thought that I still tested new code in production? ↩︎