Google+

The serving path of Hubnext

I’ve always liked things straight and direct to the point. But I feel a bit silly writing that here, since most of my blog posts from high school used a bunch of long words for no reason. My writing did eventually become more concise, but it happened around the time I also sort of stopped blogging. I stopped writing altogether for a few years, but then in college, I started doing an entirely different kind of writing. I wrote homework assignments and class announcements for the computer science classes as a teaching assistant. Technical writing is different from blogging, but writing for a student audience also has its own unique challenges. The gist of it is that students don’t like reading, which is naturally at odds with the long detailed project specs that some of our projects require. It’s the instructor’s responsibility to make sure the important details stand out from the boring reference material.

In college, I also started eating lunch and dinner on the go regularly, which is something I had never really done before. It was a combination of factors that made walk-eat meals so enticing. The campus was big, so getting from place to place usually involved a bit of walking. Street level restaurants sold fast food in convenient form factors. Erratic scheduling meant it was unusual to be with another person for more than a couple hours at a time, so sit-down meals didn’t always make sense. I got really good at the different styles of walk-eating, from gnawing on burritos without spilling to balancing take-out boxes between my fingers. Eating on my feet felt like the most distilled honest form of feeding. It didn’t make sense to add so many extra steps to accomplish essentially the same task.

For a long time, the serving path of this website was too complicated. The serving path is all the things directly involved in delivering web pages to visitors, which can be contrasted with auxiliary systems that might be important, but aren’t immediately required for the website to work. I made my first website more than 12 years ago. Lots of things have changed since then, both with me and with the ecosystem of web development in general. So when I set out to reimplement RogerHub, I wanted to eliminate or replace as many parts of the serving path as I could.

Mission beach in San Francisco.

Let’s start with the basics. Most of the underlying infrastructure for RogerHub hasn’t changed since I moved the website from Linode to Google Cloud Platform in January 2016. I use Google’s Cloud DNS and HTTP(S) Load Balancing to initially capture inbound traffic. Google’s layer 7 load balancing provides global anycast, geographic load balancing, TLS termination, health checking, load scaling, and HTTP caching, among other things. It’s the best offering available at its low price point, so there’s little reason to consider alternatives1. However, HTTP(S) Load Balancing did have a 2 hour partial outage last October. I don’t get much traffic in October, but the incident made me start thinking of ways I could mitigate a similar outage in the future.

Behind the load balancer, RogerHub is served by a number of functionally identical backend servers (currently 2). These servers are fully capable of serving user traffic directly, but they currently only accept traffic from Google’s designated load balancer IP ranges. During peak seasons, I usually scale this number up to 3, but the extra server costs me about $50 a month, so I usually prefer to run just 2. These extra servers exist entirely for redundancy, not capacity. A single server could serve a hundred times peak load with no problem2.

Each server is a plain old Ubuntu host running an instance of PostgreSQL and an instance of the Hubnext application. That’s it. There’s no reverse proxy, no memcached, no mail server, and not even a provisioner (Hubnext knows how to install and configure its own dependencies). Hubnext itself can run in multiple modes, most of which are for maintenance and debugging. But my web servers start Hubnext in “application” mode. When Hubnext runs in application mode, it starts an ensemble of concurrent tasks, only one of which is the web server. The others are things like an unique ID allocator, peer to peer data syncing, system health checkers, maintenance daemons, and half a dozen kinds of in-memory caches. Hubnext can renew its own TLS certificate, create backups of its data, and even page me on my iPhone when things go wrong. Before Hubnext, these tasks were handled by haphazard cron jobs and independent services, which were set up and configured by a Puppet provisioner. Keeping all these tasks in a single application has made developing new features a lot simpler.

So why does Hubnext still need PostgreSQL? It’s true that Hubnext could have simply kept track of its own data, along with maintaining the various indexes that make common operations fast. But it’s an awful lot of work and unneeded complexity to implement something that a database already does for free. Of all the components of a traditional website’s architecture, I chose to keep the database, because I think PostgreSQL pulls its weight more than any of the other systems that Hubnext does supplant. That being said, Hubnext intentionally doesn’t use the transactions or replication that PostgreSQL provides (e.g. the parts of a database most sensitive to failure). Instead, Hubnext’s data model is designed to work without multi-statement transactions and Hubnext performs its own application level data replication, which is probably easier to configure and troubleshoot than database replication3.

When a request arrives at a Hubnext server, it gets wrapped in a tracker (for logging and metrics) and then enters the web request router. Instead of matching by request path, the Hubnext request router gives each registered web handler a chance to accept responsibility for a request before it falls through to the next handler. This allows for more complex and free-form routing logic. The web router starts with basic things like an OPTIONS handler and DoS rejection. The vast majority of requests are handled by the Hubnext blob cache and page cache. These handlers keep bounded in-memory caches of popular blobs (images, JavaScript, and CSS) and pre-rendered versions of popular pages. But even on a cache hit, the server still runs some code to process things like cache headers, ETags, and compression.

Blobs whose request path starts with /_version_ get long cache expiration times, which instructs Google Cloud CDN to cache them. Overall, about 40% of the requests to RogerHub are served from the CDN without ever reaching Hubnext. You can distinguish these cache hits by their Age header. ETags are generated from the SHA256 hash of the blob’s path and payload. Compression is applied to blobs based on a whitelist of compressible MIME types. The blob cache holds both the original and the gzipped versions of the payload, along with pre-computed ETags and other metadata, like the blob’s modification date.

All of the remaining requests4 eventually make their way to PostgreSQL for content. As I said in a previous blog post, Hubnext stores all of its content in PostgreSQL, including binary blobs like images and PDFs. This isn’t a problem, because Hubnext communicates with its database over a local unix socket and most blobs are cached in memory anyway. This does prevent Hubnext from using sendfile to deliver blobs like filesystem-based web servers do, but there aren’t many large files hosted on RogerHub anyway.

At this point, nearly all requests have already been served by one of the previously mentioned handlers. But the majority of Hubnext’s web server code is dedicated to serving the remaining fraction. This includes archive pages, search results, new comment submissions, the RSS feed, and legacy redirect handlers. If all else fails, Hubnext gives up and returns a 404. The response is sent back to the load balancer over HTTP/1.1 with TLS5 and then forwarded on to the user. Thus completes the request.

Is the serving path shorter than you imagined? I think so6. More code sometimes just means additional maintenance burden and a greater attack surface. But altogether, Hubnext actually consists of far less code than the code it replaces. Furthermore, it’s nearly all code that I wrote and it’s all written in a single language as part of a single system. Say what you want about Go, but I think it’s the best language for implementing traditional CRUD services, especially those with a web component7. Hubnext is a distillation of what it means to deliver RogerHub reliably and efficiently to its audience, without the frills and patches of a traditional website. Anyway, I hope this post was a good distraction. Until next time!

  1. Plus, I happen to work for Google (but I didn’t when I first signed up with GCP), so I hear a lot about these services at company meetings. ↩︎
  2. If required, Hubnext could theoretically run on 10-20 instances with no problem. But the peer to peer syncing protocol is meant to be a fully connected graph, so at some point, it might run into issues with too much network traffic. ↩︎
  3. Hubnext also doesn’t use PostgreSQL’s backup tools, because Hubnext can create application-specific backups that are more convenient and understandable than traditional database backups. ↩︎
  4. Some pages simply aren’t cached, like search result pages and 404s. ↩︎
  5. Google HTTP(S) Load Balancing uses HTTP/1.1 exclusively when forwarding requests to backend servers, although it prefers HTTP/2 on the front end. ↩︎
  6. But for several years, RogerHub just ran as PHP code on cheap shared web hosts, which makes even this setup seem overly complicated. But it’s actually quite a bit of work to run and maintain a modern PHP environment. You’ll probably need a dozen C extensions and a reverse proxy and perhaps a suite of cron jobs to back up all your files and databases. For a long time, I followed a bunch of announcement mailing lists to get news about security fixes in PHP and WordPress and MediaWiki and NGINX and MySQL and all the miscellaneous packages on ubuntu-security-announce. This new serving path means I really only need to keep an eye out on security announcements for Go (which tend to be rare and fairly mild) or OpenSSH or maybe PostgreSQL. ↩︎
  7. After all, that covers a lot of what Google does. ↩︎

Something else

I talk a lot about one day giving up computers and moving out, far, far away to a place where there aren’t any people, in order to become a lumberjack or a farmer. Well lately, it’s been not so much “talking” as instant messaging or just mumbling to myself. Plus, I don’t have the physique to do either of those things. My palms are soft and I’m prone to scraping my arms all the time. I like discipline as a virtue, but I also don’t really like working. And finally, my hobby is proposing stupid outlandish half-joking-half-serious expensive irresponsible plans. It’s fun, and I guess you can’t really say something’s a bad idea until you’ve thought it through yourself.

Joking aside, my motivation comes from truth. Computers are terrible. It becomes more and more apparent to me every year that passes. I want to get far, far away from them. They’re the source of most of my problems1. And lately, it has become clear that lots of other people feel the same way.

Terribleness isn’t the computer’s fault. It all comes from a fundamental disconnect between what people think their computers are and the terrible reality of it all. A massive fraction of digital data is at constant risk of being lost forever, because it’s stored without redundancy on consumer devices scraped from the bottom of the barrel. Critical security vulnerabilities lurk in every computer system. Most of the time, they aren’t discovered simply because nobody has bothered looking hard enough. But given the lightest push, so much software2 just falls apart. You can break all kinds of software just by telling it to do two things at once. Load an article and swipe back at the same time? Bam! Your reddit app UI is now unusable. Submit the same web form from two different tabs? Ta-da! Your session state is now corrupt.

At this moment, my Comcast account is fucked beyond repair after a series of service cancellations, reinstatements, and billing credits put it in an unusual state. I needed to order internet service at my next apartment. I couldn’t do it on the website, probably because of the previous issues. I also didn’t think it was worthwhile explaining to yet another customer service rep over why my account looked the way it did. So, I just made a new account and ordered it there3.

Boats at Fisherman's Wharf in Monterey Bay.

You can get rid of all these problems if you try hard enough. Some people simply don’t own any data. Their email and photos and documents and (nonexistent) depositories of old code just come and go along with the devices they’re stored on. They don’t have data at risk of compromise, because they don’t have any digital secrets and important computer systems (like their bank) just have humans double-checking things every step of the way.

Alternatively, you could put expiration dates on your data. Email keeps for 2 years. Photos: five. Sort your data into annual folders, and when the expiration date passes, simply delete it. This strategy takes the focus off of maintaining perfect data integrity and opens up a new method to measure your success. At the end of the year, if you never had trouble finding an old photo and your credit score was doing alright, then declare success.

Maybe you could live somewhere where people don’t really need computers. Maybe they have a phone that only makes calls and a nice big TV and maybe some books and paper and stuff to do outside. Or maybe they have iPads too—little TV’s that you can touch.

Or you could simply come to terms with the way things are. After all, your data only needs to last as long as your own body does.

A rusty old tractor.

I stayed on a farm over Memorial Day weekend. It was owned by a couple. They had a tractor and some trucks and dogs and a jaccuzi and two little huts to rent out on Airbnb. I wonder if they had computers at their farm, or if they just had iPads.

Do they enjoy having a lot of land? Or is it wearisome waking up to the same rusting trucks and dirt roads every day? They probably have a lot of privacy, on account of their lack of upstairs, downstairs, and adjacent neighbors in their non-apartment home. Although, that’s probably less true if internet strangers are renting out their cabins all the time.

They had rows of crops, like the ones you see along highway 5 in central California. I wasn’t sure exactly who they belonged to, since their nearest neighbor was a solid 10 minute walk down the road. I wonder if they’re all friends.

My Subaru Impreza parked outside a cabin.

I spend a lot of my weekends working on my personal infrastructure. I can’t quite explain what it is I’m working on or why it needs work at all. It’s a combination of data management, software sandboxing, and build systems. At some point, I have to wonder if I’m spending more time working on my tools than actually using those tools to do stuff.

I also have a bunch of to-do’s on my Todoist account. The length of those lists has grown considerably since I graduated college a year ago4. I can’t believe it’s already been more than a year since then. How soon will it be two years? Did I live this past year right? Did I accomplish enough things?

Fields in Salinas.

I guess the answer really depends on how you measure “enough”. A year ago, I told myself that I wanted to become the best computer programmer that I could be. I feel like I’ve made lots of progress on that front. I picked up lots of good habits at work and my personal infrastructure is better than it’s ever been. I did a lot of cool stuff with RogerHub, and it finally has infrastructure that I can be proud of. But somehow, I’m also more dissatisfied with my computing than ever before.

I’m still living in the same area, and I’ve just committed to living here for yet another year. My plan had always been to stay in the Bay Area for three or four years, and then move away. But where to, and what for? Right now, I can’t even imagine what it’d be like to leave California5.

I’ve gotten better at cooking. More specifically, I think I’ve gotten more familiar with what kinds of food I like. On most days, I cook dinner for myself at home, and then I watch YouTube and browse the internet until I have to go to sleep. If it’s May or December, I also have to answer a few math questions.

But I haven’t made any new friends in the last year, other than my direct coworkers. I’ve spent more free time at home than ever before. I spend a small fortune on rent, after all. I might as well try to get some value out of it.

Bread and strawberries on a table.

Three years ago, I remember thinking that I didn’t know any adults who were actually happy, from which I concluded that growing up just kind of sucked.

I rarely drive anymore. I used to think driving at night on empty roads was nice and relaxing. But the roads around here just feel crowded and dangerous.

I’ll continue to work on computers, because that’s all I’m good at. More often than not, the last thought in my head before I fall asleep is an infinite grid of computers. On the bright side, I fixed my chair today and I bought a new toilet seat. As long as there’s a trickle of new purchases arriving at my doorstep, maybe things will be alright.

This post was kind of a downer. I’ll return with more website infrastructure next week6.

And since I haven’t posted a photo of me in a while, here’s a recent one:

Roger sitting by the ocean.

The photo on my home page doesn’t even look like me anymore. But I like it as an abstract icon, so it’ll probably stay there for a while, until I change it to something else.

  1. Which, I suppose, is pretty fortunate compared to the alternative. ↩︎
  2. You might blame software instead of the computer itself, but to most people, they’re the same thing. ↩︎
  3. I’m grateful I have such easy access to additional email addresses and phone numbers. ↩︎
  4. In fact, I only recently discovered that Todoist imposes a 200 task limit per project. ↩︎
  5. In any case, I realized this past weekend that I have way too many physical belongings to haul around, if I ever want to move. ↩︎
  6. I backdated this to July 31st, since that’s when I wrote most of the content. ↩︎

Training angst

Have you ever used Incognito Mode because you wanted to search for something weird, but you didn’t want it showing up in targeted ads? Or have you ever withheld a Like from a YouTube video, because although you enjoyed watching it, you weren’t really interested in being recommended more of the same? I have. And since I can’t hear you, I’ll assume you probably have too. People have gotten accustomed to the idea of “training” their computers to behave how they want, much like you’d train a dog or your nephew. And whether you study computer science or psychology or ecology or dog stuff, the principles of reinforcement learning are all about the same.

The reason you don’t search weird stuff while logged in or thumbs-up everything indiscriminately is that you’re trying to avoid setting the wrong example. But occasional slip ups are a fact of life. To compensate, many machine learning models offer mechanisms to deal with erroneous (”noisy”) labels in training data. The constraints of a soft margin SVM include a slack term that represents a trade-off between classification accuracy and resilience against mislabeled examples. Because computers can’t tell which of its training examples are mislabeled and which are simply unusual, it does the next best thing: each example can be rated based on how “believable” it is in comparison to other examples. Then, finding the optimal parameters is simply a matter of minimizing unbelievability.

Avoiding bad examples is in your best interest if you want the algorithm to give you the best recommendations. So, your YouTube Liked Videos list is probably only a rough approximation of the videos that you actually like1. Now, a computer algorithm won’t mind if you lie (a little) to it. But the real tragedy is that as a side effect, YouTube has effectively trained you, the user, to give Likes not to the videos you actually like, but to the kinds of videos you want recommendations for. In fact, most kinds of behavioral training induce this kind of reverse effect. The trainer lies a little bit to the trainee, in order to push the training in the right direction. And in return, the trainer drifts a little farther from the truth.

Parents do this to their children. Friends do it to their friends. Even if you try to be honest, the words you say and the reactions you make end up deviating ever so slightly from the truth, because you can’t help but think that your actions will end up somebody’s brain as subconscious behavioral training examples2. If your friend invites you to do something you don’t want to do, maybe you’ll say yes, or else they might not even ask next time. And if they say something you don’t like, maybe you’ll act angrier than you really are, so they won’t mention it ever again. Every decision starts with “how do I feel about this?”, but is quickly followed up with “how will others feel about my feeling about this?”. This isn’t plain old empathy. It’s true that human-to-human behavioral training helps people get along with each other. But when our words and actions are influenced by how we think they’ll affect someone else’s behavior, they end up being fundamentally just another form of lying. And unlike a computer recommendation algorithm, people might actually hate you for lying to them.

This has been weighing on my mind for a long time. I think it’s unbearably hard for adults to be emotionally honest with each other, even for close friends or family. But the problem isn’t with the words we say. Of course, people want others to think of them a certain way, whether it’s about your money or your job or your passions or romance or mental health. And people lie about those things all the time. That isn’t what keeps me up at night. What bothers me is that even when you’re trying your best to be absolutely honest with someone, you can’t. You say the right words, but they don’t sound right. You feel the right feelings, but your face isn’t cooperating. Your eyes get hazy from years of emotional cruft, and you’re no longer able to really see the person right in front of you. And it’s all because we spend every day training each other with almost-truths.

A flower.

  1. The Like button is actually short for the “recommend me more videos Like this one” button. ↩︎
  2. Have you watched Bojack Horseman? ↩︎

Data loss and you

My laptop’s hard drive crashed in 2012. I was on campus walking by Evans Hall, when I took my recently-purchased Thinkpad x230 out of my backpack to look up a map (didn’t have a smartphone), only to realize it wouldn’t boot. This wasn’t a disaster by any means. It set me back $200 to rush-order a new 256GB Crucial M4 SSD. But since I regularly backed up my data to an old desktop running at my parent’s house, I was able to restore almost everything once I received it1.

I never figured out why my almost-new laptop’s hard drive stopped working out of the blue. The drive still spun up, yet the system didn’t detect it. But whether it was the connector or the circuit board, that isn’t the point. Hardware fails all the time for no reason2, and you should be prepared for when it happens.

Data management has changed a lot in the last ten years, primarily driven by the growing popularity of SaaS (”cloud”) storage and greatly improved network capacity. But one thing that hasn’t changed is that most people are still unprepared for hardware failure when it comes to their personal data. Humans start manufacturing data from the moment they’re born. Kids should really be taught data husbandry, just like they’re taught about taxes and college admissions and health stuff. But anyway, here are a few things I’ve learned about managing data that I want to share:

Identify what’s important

Data management doesn’t work if you don’t know what you’re managing. In other words, what data would make you sad if you lost access to it? Every day, your computer handles massive amounts of garbage data: website assets, Netflix videos, application logs, PDFs of academic research, etc. There’s also the data that you produce, but don’t intend to keep long-term: dash cam and surveillance footage (it’s too big), your computer settings (it’s easy to re-create), or your phone’s location history (it’s too much of a hassle to extract).

For most people, important data is the data that’s irreplaceable. It’s your photos, your notes and documents, your email, your tax forms, and (if you’re a programmer) your enormous collection of personal source code.

Consider the threats

It’s impossible to predict every possible bad thing that could happen to your data. But fortunately, you don’t have to! You can safely ignore all the potential data disasters that are significantly less likely to occur than your own untimely death3. That leaves behind a few possibilities, roughly in order of decreasing likelihood:

  • Hardware failure
  • Malicious data loss (somebody deletes your shit)
  • Accidental data loss (you delete your shit)
  • Data breach (somebody leaks your shit)
  • Undetected data degradation

Hardware failures are the easiest to understand. Hard drives (external hard drives included), solid state drives, USB thumb drives, and memory cards all have an approximate “lifespan”, after which they tend to fail catastrophically4. The rule of thumb is 3 years for external hard drives, 5 years for internal hard drives, and perhaps 10 years for enterprise-grade hard drives.

Malicious data loss has become much more common these days, with the rise of a digital extortion scheme known as “ransomware”. Ransomeware encrypts user files on an infected machine, usually using public-key cryptography in at least one of the steps. The encryption is designed so that the infected computer can encrypt files easily, but is unable to reverse the encryption without the attacker’s cooperation (which is usually made available in exchange for a fee). Fortunately, ransomeware is easily detectable, because the infected computer prompts you for money once the data loss is complete.

On the other hand, accidental data loss can occur without anybody noticing. If you’ve ever accidentally overwritten or deleted a file, you’ve experienced accidental data loss. Because it can take months or years before accidental data loss is noticed, simple backups are sometimes ineffective against it.

Data breaches are a unique kind of data loss, because it doesn’t necessarily mean you’ve lost access to the data yourself. Some kinds of data (passwords, tax documents, government identification cards) lose their value when they become available to attackers. So, your data management strategy should also identify if some of your data is condential.

Undetected data degradation (or “bit rot”) occurs when your data becomes corrupted (either by software bugs or by forces of nature) without you noticing. Modern disk controllers and file systems can provide some defense against bit rot (for example, in the case of a bad sectors on a hard disk). But the possibility remains, and any good backup strategy needs a way to detect errors in the data (and also to fix them).

Things you can’t backup

Backups and redundancy are generally the solutions to data loss. But you should be aware that there are some things you simply can’t backup. For example:

  • Data you interact with, but can’t export. For example, your comments on social media would be difficult to backup.
  • Data that’s useless (or less useful) outside of the context of a SaaS application. For example, you can export your Google Docs as PDFs or Microsoft Word files, but then they’re no longer Google Docs.

Redundancy vs backup

Redundancy is buying 2 external hard drives, then saving your data to both. If either hard drive experiences a mechanical failure, you’ll still have a 2nd copy. But this isn’t a backup.

If you mistakenly overwrite or delete an important file on one hard drive, you’ll probably do the same on the other hard drive. In a sense, backups require the extra dimension of time. There needs to be either a time delay in when your data propagates to the backup copy, or better yet, your backup needs to maintain multiple versions of your data over time.

RAID and erasure encoding both offer redundancy, but do not count as a backup.

Backups vs archives

Backups are easier if you have less data. You can create archives of old data (simple ZIP archives will do) and back them up separately from your “live” data. Archives make your daily backups faster and also make it easier to perform data scrubbing.

When you’re archiving data, you should pick an archive format that will still be readable in 30 to 50 years. Proprietary and non-standard archive tools might fall out of popularity and become totally unusable in just 10 or 15 years.

Data scrubbing

One way to protect against bit rot is to check it periodically against known-good versions. For example, if you store cryptographic checksums with your files (and also digitally sign the checksums), you can verify the checksums at any time and detect bit rot. Make sure you have redundant copies of your data, so that you can restore corrupted files if you detect errors.

I generate SHA1 checksums for my archives and sign the checksums with my GPG key.

Failure domain

If your backup solution is 2 copies on the same hard drive, or 2 hard drives in the same computer, or 2 computers in the same house, then you’re consolidating your failure domain. If your computer experiences an electrical fire or your house burns down, then you’ve just lost all copies of your data.

Onsite vs offsite backups

Most people keep all their data within a 20 meter radius of their primary desktop computer. If all of your backups are onsite (e.g. in your home), then a physical disaster could eliminate all of the copies. The solution is to use offsite backups, either by using cloud storage (easy) or by stashing your backups at a friend’s house (pain in the SaaS).

Online vs offline backups

If a malicious attacker gains access to your system, they can delete your data. But they can also delete any cloud backups5 and external hard drive backups that are accessible from your computer. It’s sometimes useful to keep backups of your data that aren’t immediately deletable, either because they’re powered off (like an unplugged external hard drive) or because they’re read-only media (like data backups on Blu-ray Discs).

Encryption

You can reduce your risk of data leaks by applying encryption to your data. Good encryption schemes are automatic (you shouldn’t need to encrypt each file manually) and thoroughly audited by the infosec community. And while you’re at it, you should make use of your operating system’s full disk encryption capabilities (FileVault on macOS, BitLocker on Windows, and LUKS or whatever on Linux).

Encrypting your backups also means that you could lose access to them if you lose your encryption credentials. So, make sure you understand how to recover your encryption credentials, even if your computer is destroyed.

Online account security

If you’re considering cloud backups, you should also take steps to strengthen the security of your account:

  • Use a long password, and don’t re-use a password you’ve used on a different website.
  • Consider using a passphrase (a regular english sentence containing at least 4-5 uncommon words). Don’t share similar passphrases for multiple services (like “my facebook password”), because an attacker with access to the plaintext can easily guess the scheme.
  • Turn on two-factor authentication. The most common 2FA scheme (TOTP) requires you to type in a 6-8 digit code whenever you log in. You should prefer to use a mobile app (I recommend Authy) to generate the code, rather than to receive the code via SMS. Don’t forget to generate backup codes and store them in a physically secure top-secret location (e.g. underneath the kitchen sink).
  • If you’re asked to set security questions, don’t use real answers (they’re too easy to guess). Make up gibberish answers and write them down somewhere (preferably a password manager).
  • If your account password can be recovered via email, make sure your email account is also secure.

Capacity vs throughput

One strong disadvantage of cloud backups is that transfers are limited to the speed of your home internet, especially for large uploads. Backups are less useful when they take days or weeks to restore, so be aware of how your backup throughput affects your data management strategy.

This problem also applies to high-capacity microSD cards and hard drives. It can take several days to fully read or write a 10TB data archival hard drive. Sometimes, smaller but faster solid state drives are well worth the investment.

File system features

Most people think of backups as “copies of their files”. But the precise definition of a “file” has evolved rapidly just as computers have. File systems have become very complex to meet the increasing demands of modern computer applications. But the truth remains that most programs (and most users) don’t care about most of those features.

For most people, your “files” refers to (1) the directory-file tree and (2) the bytes contained in each file. Some people also care about file modification times. If you’re a computer programmer, you probably care about file permission bits (perhaps just the executable bit) and maybe symbolic links.

But consider this (non-exhaustive) list of filesystem features, and whether you think they need to be part of your data backups:

  • Capitalization of file and directory names
  • File owner (uid/gid) and permission bits, including SUID and sticky bits
  • File ACLs, especially in an enterprise environment
  • File access time, modification time, and creation time
  • Extended attributes (web quarantine, Finder.app tags, “hidden”, and “locked”)
  • Resource forks, on macOS computers
  • Non-regular files (sockets, pipes, character/block devices)
  • Hard links (also “aliases” or “junctions”)
  • Executable capabilities (maybe just CAP_NET_BIND_SERVICE?)

If your answer is no, no, no, no, no, what?, no, no, and no, then great! The majority of cloud storage tools will work just fine for you. But the unfortunate truth is that most computer programmers are completely unaware of many of these file system features. So, they write software that completely ignores them.

Programs and settings

Programs and settings are often left out of backup schemes. Most people don’t have a problem reconfiguring their computer once in a while, because catastrophic failures are unlikely. If you’re interested in creating backups of your programs, consider finding a package manager for your preferred operating system. Computer settings can usually be backed up with a combination of group policy magic for Windows and config files or /usr/bin/defaults for macOS.

Application-specific backup

If you’re backing up data for an application that uses a database or a complex file-system hierarchy, then you might be better served by an backup system that’s designed specifically for that application. For example, RogerHub runs on a PostgreSQL database, which comes with its own backup tools. But RogerHub uses an application-specific backup scheme that’s tailored to RogerHub specifically.

Testing

A backup isn’t a backup until you’ve tested the restoration process.

Recommendations

If you’ve just skipped to the end to read my recommendations, fantastic! You’re in great company. Here’s what I suggest for most people:

  • Use cloud services instead of files, to whatever extent you feel comfortable with. It’s most likely not worth your time to backup email or photos, since you could use Google Inbox or Google Photos instead.
  • Create backups of your files regularly, using the 3-2-1 rule: 3 copies of your data, on 2 different types of media, with at least 1 offsite backup. For example, keep your data on your computer. Then, back it up to an online cloud storage or cloud backup service. Finally, back up your data periodically to an external hard drive.
  • Don’t trust physical hardware. It doesn’t matter how much you paid for it. It doesn’t matter if it’s brand new or if you got the most advanced model. Hardware breaks all the time in the most unpredictable ways.
  • Don’t buy an external hard drive or a NAS as your primary backup destination. They’re probably no more reliable than your own computer.
  • Make sure to use full-disk encryption and encrypted backups.
  • Make sure nobody can maliciously (or accidentally) delete all of your backups, simply by compromising your primary computer.
  • Consider making archives of data that you use infrequently and no longer intend to modify.
  • Secure your online accounts (see section titled “Online account security”)
  • Pat yourself on the back and take a break once in a while. Data management is hard stuff!

If you find any mistakes on this page, let me know. I want to keep it somewhat updated.

And, here’s yet another photo:

Branches.

  1. My laptop contained the only copy of my finished yet unsubmitted class project. But technically I had a project partner. We didn’t actually work together on projects. We both finished each project independently, then just picked one version to submit. ↩︎
  2. About four and a half years later, that m4 stopped working and I ordered a MX300 to replace it. ↩︎
  3. That is, unless you’re interested in leaving behind a postmortem legacy. ↩︎
  4. There are other modes of failure other than total catastrophic failure. ↩︎
  5. Technically, most reputable cloud storage companies will keep your data for some time even after you delete it. If you really wanted to, you could explain the situation to your cloud provider, and they’ll probably be able to recover your cloud backups. ↩︎

Life lessons from artificial intelligence

If you speak to enough software engineers, you’ll realize that many of them can’t understand some everyday ideas without using computer metaphors. They say “context switching” to explain why it’s hard to work with interruptions and distractions. Empathy is essentially machine virtualization, but applied to other people’s brains. Practicing a skill is basically feedback-directed optimization. Motion sickness is just your video processor overheating, and so on.

A few years ago, I thought I was the only one whose brain used “computer” as its native language. And at the time, I considered this a major problem. I remember one summer afternoon, I was playing scrabble with some friends at my parents’ house. At that time, I had just finished an internship, where day-to-night I didn’t have much to think about other than computers. And as I stared at my scrabble tiles, I realized the only word I could think of was EEPROM1.

It was time to fix things. I started reading more. I’ve carried a Kindle in my backpack since I got my first Kindle2 in high school, but I haven’t always used it regularly. It’s loaded with a bunch of novels. I don’t like non-fiction, especially the popular non-fiction about famous politicians and the economy and how to manage your time. It seems like a waste of time to read about reality, when make-believe is so much more interesting.

I also started watching more anime. I especially like the ones where the main character has a professional skill and that skill becomes a inextricable part of their personal identity3. During my last semester in college, I thought really hard about whether I really wanted to just be a computer programmer until I die, or whether I simply had no other choice, because I wasn’t good at anything else. And so, I watched Hibike! Euphonium obsessively, searching for answers.

Devoting your life to a skill can be frustrating. It makes you wonder if you’d be a completely different person if that part of you were suddenly ripped away. And then there’s the creeping realization that your childhood passion is slowly turning into yet another boring adult job. It’s like when you’re a kid and you want to be the strongest ninja in your village, but then you grow up and start working as a mercenary. You can still do ninja stuff all day, but it’s just not fun anymore.

But I like those shows because it’s inspiring and refreshing to watch characters who really care about being really good at something, as long as that something isn’t just “make a ton of money”. I think it’s important to have passion and a competitive spirit for at least one thing. It’s no fun being just mediocre at a bunch of things. Plus, being good at something gives you a unique perspective on the world, and that perspective comes with insights worth sharing.

I thought a lot about Q-learning during the months after my car accident. I think normal people are generally unprepared to respond rationally in crisis situations. And that’s at least partially because most of us haven’t spent enough time evaluating the relative cost of all the different terrible things that might happen to us on a day to day basis. Q-learning is a technique for decision-making that relies on predicting the expected value of taking an action in a particular state. In order for Q-learning to work, you need models for both the state transitions (what could happen if I take this action?) and a cost for each of the outcomes. If you understand the transitions, but all of your costs are just “really bad, don’t let that happen”, then in a pinch, it becomes difficult to decide which bad outcome is the least terrible.

There are little nuggets of philosophy embedded all over the fields of artificial intelligence and machine learning. I skipped a lot of class in college, but I never skipped my introductory AI and ML classes. It turns out that machine learning and human learning have a lot in common. Here are some more ideas, inspired by artificial intelligence:

I try to spend as little time as possible shopping around before buying something, and that’s partially because of what’s called the Optimizer’s Curse4. The idea goes like this: Before buying something, you usually look at all your options and pick the best one. Since people aren’t perfect, sometimes you overestimate or underestimate how good your options are. The more options you consider, the higher the probability that the perceived value of your best option will be much greater than its actual value. Then, you end up feeling disappointed, because you bought something that isn’t as good as you thought it’d be.

Now that doesn’t mean you should just buy the first thing you see, since your first option might turn out to be really shitty. But if you’re reasonably satisfied with your options, it’s probably best to stop looking and just make your choice.

But artificial intelligence also tells us that it’s not smart to always pick the best option. Stochastic optimization methods are based on the idea that, sometimes, you should take suboptimal actions just to experience the possibilities. Humans call this “stepping out of your comfort zone”. Machines need to strike a balance between “exploration” (trying out different options to see what happens) and “exploitation” (using experience to make good decisions) in order to succeed in the long run. This balance is called the “learning rate”, and a good learning rate decreases over time. In other words, young people are supposed to make poor decisions and try new things, but once you get old, you should settle down5.

The difference in cumulative value resulting from sub-optimal decisions is known as “regret”. In the long run, machines should learn the optimal policy for decision-making. But machines should also try to reach this optimum with as little regret as possible. This is accomplished by adjusting the learning rate.

So is it wrong for parents to make all of their children’s decisions? A little guidance is probably valuable, but a too conservative learning rate converges to a suboptimal long-term policy6. I suppose kids should act like kids, and if they scrape their knees and do stupid stuff and get in trouble, that’s probably okay.

Anyway, there’s one more artificial intelligence technique that I don’t understand too well, but it comes with interesting implications for humans. It’s a technique for path planning applied to finite LQR problems, which are a type of problem where the system mechanics can be described linearly and the cost function is quadratic with the state. These restrictions yield a formulation that lets us compute a policy that is independent of the state of the system. In other words, the machine plans a path by starting at the goal, then working backward to determine what leads up to that goal.

The same policy can be applied no matter your goal (”terminal condition”), because all the mechanics of the system are encoded in the policy. For example, if your goal is to build rockets at NASA, then it’s useful to consider what needs to happen one day, one month, or even one year before your dream comes true. The policy becomes less and less useful when the distance to your goal increases, but by working backward far enough, you can figure out what to do tomorrow to take the first step.

And if your plans don’t work out, well don’t worry, because the policy is independent of the state of the system. You can reevaluate your trajectory at any point to put yourself back on the right track7.

I miss learning signal processing and computer graphics and machine learning and all of these classes with a lot of math in them. I work on infrastructure and networking at work, which is supposedly my specialization. But I also feel like I’m missing out on a lot of great stuff that I used to be interested in. The math-heavy computer science courses always felt a little more legit. I always imagined college to be a lot of handwriting and equations and stuff. Maybe I’ll pick up another side project for this stuff soon.

And here’s a photo of the hard disk from my first laptop:

A hard disk lying on some leaves.

It died less than a month after I got the laptop. After that, I started backing up my data more religiously. Plus, I replaced the spinning rust with a new Crucial M4 and that lasted for about 4.5 years until it broke too. I still kept this hard drive chassis and platter, because it looks cool.

  1. Acronyms aren’t allowed anyway. ↩︎
  2. My first Kindle was a 3rd generation Kindle Keyboard. When I broke that one, I bought another Kindle Keyboard even though a newer model had been released. I didn’t want my parents to notice I had broken my Kindle so soon after I got it, so I hid the old Kindle in a manilla envelope and used its adopted brother instead. Three years later, I upgraded to the Paperwhite, and that’s still in my backpack today. ↩︎
  3. See this or this. ↩︎
  4. But also partially because I’m a lazy bastard. ↩︎
  5. And yet, I haven’t left my apartment all weekend. ↩︎
  6. PAaaS: parenting advice as a service. ↩︎
  7. On second thought, this doesn’t have much to do with artificial intelligence. ↩︎

The data model of Hubnext

I got my first computer when I was 8. It was made out of this beige-white plastic and ran a (possibly bootlegged) copy of Windows ME1. Since our house had recently gotten DSL installed, the internet could be on 24 hours a day without tying up the phone line. But I didn’t care about that. I was perfectly content browsing through each of the menus in Control Panel and rearranging the files in My Documents. As long as I was in front of a computer screen, I felt like I was in my element and everything was going to be alright.

Computers have come a long way. Today, you can rent jiggabytes of data storage for literally pennies per month (and yet iPhone users still constantly run out of space to save photos). For most people living in advanced capitalist societies, storage capacity has been permanently eliminated as a reason why you might consider deleting any data at all. For people working in tech, there’s a mindset known as “big data”, where businesses blindly hoard all of their data in the hope that some of it will become useful at some time in the future.

On the other hand, I’m a fan of “small data”. It’s the realization that, for many practical applications, the amount of useful you have is dwarfed by the overwhelming computing and storage capacity of modern computers. It really doesn’t matter how inefficient or primitive your programs are, and that opens up a world of opportunities for most folks to do ridiculous audacious things with their data2.

When RogerHub ran on WordPress, I set up master-slave database and filesystem replication for my primary and replica web backends. WordPress needs to support all kinds of ancient shared hosting environments, so WordPress core makes very few assumptions about its operating environment. But WordPress plugins, on the other hand, typically make a lot of assumptions about what kinds of things the web server is allowed to do3. So the only way to really run WordPress in a highly-available configuration is to treat it like a black box and try your best to synchronize the database and filesystem underneath it.

RogerHub has no need for all of that complexity. RogerHub is small data. Its 38,000 comments could fit in the system memory of my first cellphone4 and the blobs could easily fit in the included external microSD card. But perhaps more important than the size of the data is how simple RogerHub’s dataset is.

Database replication comes with its own complexities, because it assumes you actually need transaction semantics5. Filesystem replication is mostly a crapshoot with no meaningful conflict resolution strategy for applications that use disk like a lock server. But RogerHub really only collects one kind of data: comments. The nice thing about my comments is that they have no relationship to each other. You can’t reply directly to other comments. Adding a new comment is as simple as inserting it in chronological order. So theoretically, all of this conflict resolution mumbo jumbo should be completely unnecessary.

I call the new version of RogerHub “hubnext” internally6. Hubnext stores all kinds of data: comments, pages, templates7, blobs8, and even internal data, like custom redirects and web certificates. Altogether, these different kinds of data are just called “Things”.

One special feature of hubnext is that you can’t modify or delete a Thing, once it has been created (e.g. an append-only data store). This property makes it really easy to synchronize multiple sets of Things on different servers, since each replica of the hubnext software just needs to figure out which of its Things the other replicas don’t have. To make synchronization easier, each Thing is given a unique identifier, so hubnext replicas can talk about their Things by just using their IDs.

Each hubnext replica keeps a list of all known Thing IDs in memory. It also keeps a rolling set hash of the IDs. It needs to be a rolling hash, so that it’s fast to compute H(n1, n2, …, nk, nk+1), given H(n1, n2, …, nk) and nk+1. And it needs to be a set hash, so that the order of the elements doesn’t matter. When a new ID set added to the list of Thing IDs, the hubnext replica computes the updated hash, but it also remembers the old hash, as well as the ID that triggered the change. By remembering the last N old hashes and the corresponding Thing IDs, hubnext builds a “trail of breadcrumbs” of the most recently added IDs. When a hubnext replica wants to sync with a peer, it sends its latest N hashes through a secure channel. The peer searches for the most recent matching hash that’s in both the requester’s hashes and the peer’s own latest N hashes. If a match is found, then the peer can use its breadcrumbs to generate a “delta” of newly added IDs and return them back to the requester. And if a match isn’t found, the default behavior is to assume the delta should include the entire set of all Thing IDs.

This algorithm runs periodically on all hubnext replicas. It’s optimized for the most common case, where all replicas have identical sets of Thing IDs, but it also works well for highly unusual cases (for example, when a new hubnext replica joins the cluster). But most of the time, this algorithm is completely unnecessary. Most writes (like new comments, new blog posts, etc) are synchronously pushed to all replicas simultaneously, so they become visible to all users globally without any delay. The synchronization algorithm is mostly for bootstrapping a new replica or catching up after some network/host downtime.

To make sure that every Thing has a unique ID, the cluster also runs a separate algorithm to allocate chunks of IDs to each hubnext replica. The ID allocation algorithm is an optimistic majority consensus one-phase commit with randomized exponential backoff. When a hubnext replica needs a chunk of new IDs, it proposes a desired ID range to each of its peers. If more than half of the peers accept the allocation, then hubnext adds the range to its pool of available IDs. If the peers reject the allocation, then hubnext just waits a while and tries again. Hubnext doesn’t make an attempt to release partially allocated IDs, because collisions are rare and we can afford to be wasteful. To decide whether to accept or reject an allocation, each peer only needs to keep track of one 64-bit ID, representing the largest known allocated ID. And to make the algorithm more efficient, rejections will include the largest known allocated ID as a “hint” for the requester.

There are some obvious problems with using an append-only set to serve website content directly. To address these issue, each Thing type contains (1) a “last modified” timestamp and (2) some unique identifier that links together multiple versions of the same thing. For blobs and pages, the identifier is the canonicalized URL. For templates, it’s the template’s name. For comments, it’s the Thing ID of the first version of the comment. When the website needs to fetch some website content, it only considers the instance of the data with the latest “last modified” timestamp among multiple Things with the same identifier.

Overall, I’m really satisfied with how this data storage model turned out. It makes a lot of things easier, like website backups, importing/exporting data, and publishing new website content. I intentionally glossed over the database indexing magic that makes all of this somewhat efficient, but that’s nonetheless present. There’s also an in-memory caching layer for the most commonly-requested content (like static versions of popular web pages and assets). Plus, there’s some Google Cloud CDN magic in the mix too.

It’s somewhat unusual to store static assets (like images and javascript) in a relational database. The only reason why I can get away with it is because RogerHub is small data. The only user-produced content is plaintext comments, and I don’t upload nearly enough images to fill up even the smallest GCE instances.

Anyway, have a nice Friday. If I find another interesting topic about Hubnext, I’ll probably write another blog post like this one soon.

A bridge in Kamikochi, Japan.

  1. But not for long, because I found install disks for Windows 2000 and XP in the garage and decided to install those. ↩︎
  2. I once made a project grading system for a class I TA’ed in college. It ran on a SQLite database with a single global database lock, because that was plenty fast for everybody. ↩︎
  3. Things like writing to any location in the web root and assuming that filesystem locks are real global locks. ↩︎
  4. a Nokia 5300 with 32MB of internal flash ↩︎
  5. I’ve never actually seen any WordPress code try to use a transaction. ↩︎
  6. Does “internally” even mean anything if it’s just me? ↩︎
  7. Templates determine how different pages look and feel. ↩︎
  8. Images, stylesheets, etc. ↩︎

What’s “next” for RogerHub

Did I intentionally use 3 different smart quotes in the title? You bet I did! But did it require a few trips to fileformat.info and some Python to figure out what the proper octal escape sequences are? As a matter of fact, yes. Yes it did. And if you’re wondering, they’re \342\200\231, \342\200\234, and \342\200\2351.

The last time I rewrote RogerHub.com was in November of 2010, more than 6 years ago. Before that, I was using this PHP/MySQL blogging software that I wrote myself. RogerHub ran on cheap shared hosting that cost $44 USD per year. I moved the site to WordPress because I was tired of writing basic features (RSS feeds, caching, comments, etc.) myself. The whole migration process took about a week. That includes translating my blog theme to WordPress, exporting all my articles2, and setting up WordPress via 2000s-era web control panels and FTP.

Maybe it’s that time again? The time when I’m unhappy with my website and need to do something drastic to change things up.

To be fair, my “personal blog” doesn’t really feel like a blog anymore. Since RogerHub now gets anywhere between 217 to 221 visitors per month, it demands a lot more of my attention than a personal blog really should. During final exam season, I log onto my website every night to collect my reward: a day’s worth of final exam questions and outdated memes3. Meanwhile, I wrote 3 blog posts last year and just 1 the year before that.

I want to take back my blog. And I want to strategically reduce the amount of time I spend managing the comments section without eliminating them altogether. Lately I’ve been too scared to make changes to my blog, because of how it might break other parts of the site. On top of that, I have to build everything within the framework of WordPress, an enormous piece of software written by strangers in a language that gives me no pleasure to use. I miss when it didn’t matter if I broke everything for a few hours, because I was editing my site directly in production over FTP. And every time WordPress announces a new vulnerability in some JSON API or media attachments (all features that I don’t use), I miss running a website where I owned all of the code.

So on nights and weekends over the last 5 months, I’ve been working on a complete rewrite of RogerHub from the ground up. And you’re looking at it right now.

Why does it look exactly the same as before? Well, I lied. I didn’t rewrite the frontend or any of the website’s pages. But all the stuff under the hood that’s responsible for delivering this website to your eyeballs has been replaced with entirely new code4.

The rewrite replaces WordPress, NGINX, HHVM, Puppet, MySQL, and all the miscellaneous Python and Bash scripts that I used to maintain the website. RogerHub is now just a single Go program, running on 3 GCE instances, each with a PostgreSQL database, fronted by Google Cloud Load Balancer.

Although this website looks the same, I’ve made a ton of improvements behind the scenes that’ll make it easier for me to add features with confidence and reduce the amount of toil I perform to maintain the site. I’ll probably write more about the specifics of what’s new, but one of the most important things is that I can now easily run a local version of RogerHub in my apartment to test out new changes before pushing them live5. I’ve also greatly improved my rollout and rollback processes for new website code and configuration.

Does this mean I’ll start writing blogs again? Sure, probably.

I’m not done with the changes. I’ve only just finished the features that I thought were mandatory before I could migrate the live site over to the new system. I performed the migration last night and I’ve been working on post-migration fixes and cleanup all day today. It’s getting late, so I should just finish this post and go to sleep. But I’ll leave you with this nice photo. I used to end these posts with funny comics and reddit screencaps.

Tree branches and flowers in the fog.

It’s a little wider than usual, because I’m adding new features, and this is the first one.

  1. Two TODOs for me: memorize those escape codes and add support for automatic smart quotes in post titles ↩︎
  2. I used Google Sheets to template a long list of SQL queries, based on a phpMyAdmin dump that I copied and pasted into a spreadsheet. Then, I copied those SQL queries back into phpMyAdmin to import everything into WordPress. ↩︎
  3. By my count, I’ve answered more than 5,000 questions so far. The same $44 annual website fee is enough to run 2017’s RogerHub.com for about 2 weeks. ↩︎
  4. And that’s a big deal, I swear! ↩︎
  5. Gee, it’s 2017. Who would have thought that I still tested new code in production? ↩︎

Child prodigy

I watched a YouTube video this morning about a 13 year old boy taught himself to make iPhone apps and got famous for it. He took an internship at Facebook and then started working there full-time. There were TV stations and news websites that interviewed him and wrote about how he’s helping his family financially and how any teenager can start making tons of money if they just learn to code. And the story was nice and inspiring and stuff, except there are tons of kids that do the same thing and nobody writes articles about any of them. He’s probably 18 or 19 now1 and still working at Facebook as a product manager. How’s he feeling now? On the other hand, I’m a college senior, dreading the day when I have to start working like a grown-up and wondering if I’ll miss college and confused why people can’t just stay in college forever. He never went to college. He had probably gotten accepted at lots of different schools (did he even get a chance to apply?), but he decided college wasn’t worth the opportunity to work at Facebook and pull his family out of their crappy financial situation. Cheers to him.

I felt exactly the same way in high school. But I didn’t have a compelling reason to start working or the balls to deviate from the Good Kid Story™. I started making websites when I was 10, and by the time I finished high school, I could churn out CRUD web applications like any other rank-and-file software developer. Part of me honestly thought that I could skip a few semesters of class once I got to Berkeley, because I already knew about for-loops and I could write Hello World in a handful of languages. I thought college was going to be the place where people learn about the less-useful theoretical parts of programming. They’d teach me what a tree was, even though I never had any reason to use anything but PHP’s ubiquitous ordered hash map. I thought it wouldn’t be anything that I wouldn’t have learned anyways, if I just kept writing more and more code. And I was partially right, but also very wrong.

Getting a proper CS education is really important, and I wouldn’t recommend that anybody drop out or skip college, just so they can start working, especially if there isn’t a strong financial reason to do so. However, there’s two hard truths that people don’t like admitting about CS education: 1) most of the stuff taught to undergrads is also available on the Internet, and 2) most people who get a CS degree are still cruddy programmers. So, school isn’t irreplaceable and it’s not like attending school will magically transform you into a mature grown-up programmer. But that’s really not why getting a formal CS education is important.

After 7 semesters, it’s still hard to say exactly why people place a lot of value on getting a formal education in computer science. Most people need to be taught programming, because they have no experience and are in no shape to do anything productive with a computer. But for all the programming prodigies of the world, there needs to be another reason. I can say that I’m a much better programmer than I was four years ago. It always seems like the code I wrote the previous year is a pile of garbage2.

School forced me to learn things that I never would have learned on my own (because they were irrelevant to my own projects) nor would I have learned while working full-time (because they’d be irrelevant to the work I’d be doing). In high school, I had no idea people could write programs that did more than loading and saving data to a database. The classes I took actually expanded the range of what programs I thought were possible to write3.

When I taught myself things as a kid, I would enter a tight loop of learn-do-learn-do. Most of the code I wrote were attempts to get the Thing working as easily as possible, which ended up leading to a lot of frustration and wasted time. It’s hard to piece together a system before you understand the fundamental concepts. And that sounds really obvious, but a lot of programming tutorials seem to take that approach. They’ll tell you how to do the Thing, but they don’t bother giving you any intuition about the method itself. On the other hand, college classes have the freedom to explain the Thing in the abstract. Then once you start doing it yourself, you’ll know exactly what to look for4.

It’s really unfair to make a teenager make their own decisions about work and college, because you really shouldn’t be punished for making stupid life choices as a kid. Teaching myself programming as a kid was useful, but frankly I was a terrible teacher. But I’ve gotten better at that as well. This is my 5th semester as a teaching assistant, and I’ve picked up all kinds of awesome skills, from public speaking to technical writing, not to mention actual pedagogy as well. I’ve spent literally a thousand hours working on my tooling, because college convinced me that it really does matter5.

They say that it takes 10 years to really master a skill. Well, this is going to be my 12th year as a computer programmer, and I still don’t feel like I’ve mastered anything. I guess everybody learns in a different way, but it really sucks that society has convinced teenagers that college is optional/outdated. It’s easy to lure teenagers away from education with money and praise, especially because it’s really hard to see the point of a formal education when your entire programming career is creating applications that are essentially pretty interfaces to a database6. It doesn’t help that college-educated programmers are sometimes embarrassed to admit that school doesn’t work for everyone.

I wonder if that iPhone kid is disappointed with the reality of working full-time in software development. The free food and absurd office perks lose their novelty quickly.

  1. I have no idea actually. ↩︎
  2. Some people say that’s a good thing? I’ve realized that code is the enemy. The more code you write, the more bugs you’ve introduced. It’s incredibly hard to write code that you won’t just want to throw out next year. Code is the source of complexity and security problems, so the goal of software engineers is to produce less code, not more. When you have a codebase with a lot of parts, it’s easy to break things if you’re not careful. Bad code is unintuitive. Good code should be resistant to bugs, even when bad programmers need to modify it. ↩︎
  3. Little kids always tell you that programmers need to be good at math, which actually doesn’t make that much sense when I think about it. You need some linear algebra and calculus for computer graphics and machine learning. Maybe you’ll need to know modular arithmetic and number systems. But math really isn’t very important. ↩︎
  4. A huge number of software bugs are caused by the programmer misunderstanding the fundamentals of the thing they’re interacting with. ↩︎
  5. My favorite programming tools in high school were Adobe Dreamweaver and Notepad. I started using Ubuntu full-time in 11th grade, but didn’t make any actual efforts to improve my tools until college. ↩︎
  6. Not to underestimate the usefulness of simple CRUD apps. ↩︎

Email surveillance

There’s a new article in the SF Chronicle that says the University of California, Office of the President (UCOP) has been monitoring emails going in and out of the UC system by using computer hardware. I wanted to give my personal opinion, as a computer programmer and somebody who has experience managing mail exchangers1. The quotes in the SF Cron article are very generous with technical details about the email surveillance system. Most of the time, articles about mass surveillance are dumbed down, but this one gives us at least a little something to chew on.

Email was not originally designed to be a secure protocol. Over the three (four?) decades that email systems have been used, computer people have created several extensions to the original SMTP2 and 822 envelope protocol to provide enough modern security to make email “good enough” for modern use. Most email today is exchanged under the protection of STARTTLS, which is an extension for SMTP that upgrades a cleartext connection to an encrypted connection, if both parties support it. The goal of STARTTLS is to provide resistance against passive monitoring. It doesn’t provide any guarantees about the authenticity of the other party, because usually the certificates aren’t validated, so STARTTLS is still vulnerable against MITM attacks3. There are other email-security extensions. But they’re either designed for ensuring authenticity rather than privacy (like SPF, DKIM, and DMARC) or they’re not widely used (like GPG).

The only protection we have against passive snooping of emails is STARTTLS. According to the SF Cron article, the “intrusive device” installed at UC campuses is intended to capture and analyze traffic, rather than intercepting and modifying it. So, I took a look at some of the emails I’ve received at my personal berkeley.edu address over the last 3.5 years of living in Berkeley. I looked specifically at the advertising emails I get from Amazon.com, because I’ve been receiving them consistently for many years, and they always come from the same place (Amazon SES). All of my most recent emails from Amazon follow this path, according to the email headers:

  • Amazon SES
  • UC Berkeley Mail Server “ees-ppmaster-prod-01”
  • 3 local mail filters, called “pps.reinject”, “pps.reinject”, and “pps.filterd”
  • UC Berkeley Mail Server “ees-sentrion-ucb3”
  • Google Apps Mail Server

Before April 2015, another UC Berkeley Mail Server was part of this path, in between the “sentrion” server and the Google Apps server. Before December 2014, the path looked completely different. There was only a single server between SES and Google, which was labeled “cm06fe.ist.berkeley.edu”.

According to the email headers, each step along the path is encrypted using STARTTLS, except for some of the local mail filters. Those 3 local mail filters are programs that run on the UC Berkeley Mail Server which might do things like scanning for viruses or filtering spam. They don’t exactly need encryption, because they don’t communicate over the network. I also noticed that before May 2015, there was only 1 local mail filter (the “pps.filterd” one) instead of 3.

The SF Cron article mentions that email surveillance started after attacks on UCLA Medical Center, which occurred in July 2015. Unfortunately, nothing significant seems to have changed in the email headers between June and October of 2015. But the use of STARTTLS, even within UC Berkeley’s own networks, casts doubt on the idea that UCOP surveillance was implemented as passive network monitoring.

If the surveillance was implemented at the network level, it would have to proxy the SMTP connections between all of the “ppmaster” and “sentrion” servers, as well as spoof the source IP or routing tables or reverse DNS lookup tables of the entirety of Berkeley’s local email network. It’d be an unnecessarily sophisticated method, if they just wanted to hide the presence of surveillance hardware.

On the other hand, if surveillance was implemented with the cooperation of campus IT staff, it would be pretty simple to implement for all emails campus-wide. There are already plenty of unlabeled local mail filters in place. These could easily be configured to forward an unencrypted copy of all emails to a 3rd party vendor’s system, for monitoring and analysis. Additionally, “sentrion”, which probably refers to SendMail’s Sentrion product, looks like it was expressly designed for the purpose of recording and analyzing large amounts of email.

There are a couple of problems if email monitoring really were implemented on the mail servers themselves with the cooperation of campus IT staff. If this is really the case, then it would require another system to monitor web traffic, which doesn’t seem to be explained in the article. Or perhaps, the claim that web traffic were being monitored is incorrect4.

I’ve always accepted that work email should be considered the property of your employer. Your personal stuff should stay on your personal cell phone and email accounts. However, students are not employees of the University5. I don’t know much about law, but I feel like FERPA was passed to address these kinds of privacy questions regarding students and academic institutions. Implementing mass email surveillance without consulting faculty and students, regardless of its legality, seems underhanded and embarrassing for what claims to be the number one public university in the world.

  1. I’m currently a student and (technically?) an employee of UC Berkeley. But these opinions are my own. ↩︎
  2. The Simple Mail Transfer Protocol, which is used to deliver all publicly-routed email. ↩︎
  3. Man-in-the-middle attacks ↩︎
  4. Most web traffic (including RogerHub) goes through HTTPS today anyway. Monitoring web traffic without a MITM proxy would be ineffective. ↩︎
  5. Unless you happen to be both. ↩︎

Website updates

Last December was the biggest month for RogerHub ever. We served over 4 million hits, which consumed over 3 terabytes of bandwidth. By request, we released the 6th calculator mode, “lowest test dropped”, to the public. But during the same month, we experienced the biggest outage that has ever happened on RogerHub, which affected over 60,000 visitors, and the number of total spam comments has nearly doubled. I keep using “we”, even though this is a one-man operation, because these seasonal surges of traffic feel a lot bigger than just me. Toward the end of the month, my hosting provider Linode was targeted by several large DDoS attacks across all their US datacenters. RogerHub is run in 2 Linode locations: Dallas, TX and Fremont, CA. However, only one location is active at any time. The purpose of the inactive location is to take over the website when the primary location goes offline. There’s a lot of reasons why a Linode datacenter could fail, including physical issues with Linode machines, power outages, and network connectivity issues. During the recent DDoS attacks, Linode came very close to being offline in both Dallas and Fremont, which would have caused issues for this site. There’s another wave of traffic in January, for people who have finals after Winter Break, and it’s important that RogerHub doesn’t have an outage then.

I’ve been working on new stuff for RogerHub. I’ve decreased the payload size of the most popular pages by paginating comments. It took a while before I found a solution that both provided a pleasant user experience and allowed the comment text to be easily indexed. I’ve made the site a bit wider, and I’ve reduced the amount of space around the leaderboard ad on desktop browsers. I’ve improved the appearance of buttons on the site, and I’ve given the front page a new look. Finally, I’ve migrated RogerHub from Linode to Google Compute Engine and enabled HTTPS for the entire site.

RogerHub is using GCE’s global HTTP load balancer to terminate HTTPS connections at endpoints that are very close geographically to visitors. Google is able to provide this with their BGP anycast content distribution network. With HTTPS also comes support for SPDY and HTTP/2 on RogerHub, which remove some of the performance quirks associated with plain HTTP. I’ve also converted all my ad units to load asynchronously. My use of HTTPS and GCE’s global HTTP load balancer also makes it tricker to block RogerHub on academic WiFi networks, especially on non-school owned equipment, where TLS interception is out of the question.

You might think it’s silly to run third party ads under HTTPS, since advertising destroys any client-sided security you might claim to offer and many ad networks still don’t fully support HTTPS. I’ve always had mixed feelings about the advertising on RogerHub. Advertising covers my server costs, and I wouldn’t be able to run this site without advertising revenue. But poorly-designed advertising can ruin the user’s experience, especially on mobile devices. I’m only interested in the most unobtrusive online advertising for my website, and I try very hard to make sure that expanding ads, auto-playing video ads, and noise-making ads never get served from RogerHub. During the last few months, I’ve removed the main leaderboard ad for mobile users and I’ve removed ads from the home page as well.

In other news, a lot of RogerHub’s sites have been shutdown, including the Wiki and a bunch of miscellaneous things you’ve probably never looked at. My coding blog has been turned into static HTML, but is still available1. This is my only blog left (also my first blog), so I might use it again some time soon.

  1. There’s currently some mixed-content warnings on it, but I’ll fix it soon ↩︎

Grown ups

I haven’t posted anything to my Tumblr blog in 649 days, but in that time I’ve gained maybe 50 new followers, and they’re all strangers. I don’t think any of them are bots either. They found a link on my homepage and maybe they decided I would some day post something again. Sometimes, I click on their profile picture and check out their Tumblr blogs too. I open up web inspector and grab the URL of their avatar thumbnail, and then I change the _128 suffix to _512, because I knew that Tumblr offered avatar thumbnails with sizes in powers of 2, between 32 and 512. And then I remembered that a few years ago I built a tool to uncover Tumblr avatars and put it on RogerHub, and suddenly it feels kind of creepy checking out 512px thumbnails of strangers’ avatars, because most of them probably don’t know avatar thumbnails go up to that size.

It’s summer now, and it has been half a year since I wrote anything here on RogerHub, so I suppose I owe you an update about what’s new with me1. I feel more clumsy with words than I felt in high school, which was when I wrote new posts on this blog every week or so. It’s a side effect of sitting in a chair at work every day with my earbuds in my ears and having very little conversation with other actual people. Even when I talk during the workday, the talking is usually about computer stuff, which doesn’t help with normal talking that much. There was a time last Summer break when I felt like I had gotten really terrible at Scrabble, because all of the words that I thought of were computer words or acronyms that didn’t count as legal words. I might have told you about that already, sorry.

Working an internship has its pros and cons. The company really spoils its interns, and when I get home, I don’t have any homework. This opens up my schedule to cook more2 and also to go hang out with my friends. I don’t have to spend time in the nasty parts of Berkeley. I can read my Kindle a little bit more, and I can focus on my health.

On the other hand, I miss school and TA’ing for my class. I miss when all the projects were easy and understandable and written terribly. I miss using my own laptop and spending time on my personal data backup system and my text editor configuration3. I miss coming home to roommates that I actually talk to, and I miss living in walking distance of a lot of people.

Hm, so far, it seems like I’m just whining about missing a whole bunch of things. I suppose there are other cons to working an internship too.

I have to share my room with another person, but it’s not terrible. My roommate is cool, and I already sort of knew him from Berkeley. The internet speed sucks, and the connection is kind of unreliable. I don’t have space to set up all my tech, and I don’t feel that comfortable ordering stuff online here. The internship comes with its own kind of stress, because I want to do a good job and feel competent, but it’s not easy. I doze off sometimes, because I’m used to working for myself, not somebody else. It was always stuff for RogerHub or building out cool infrastructural stuff I wanted to have. Even when I was working on things for students and grading, it felt like working for myself. I have trouble seeing the big picture of what I’m contributing to.

I guess all these cons aren’t hard to fix. I can make new friends, and I can try to relax more at work. There isn’t much I can do about the internet speed, but I’ll just have to learn to live with that. Maybe I just need an actual vacation.

One thing I can’t stop thinking about is the possibility that right now, I’m just hanging on until the semester starts again. If that’s true, then in 1 year’s time, I’ll be in this same position again (minus some of the intern perks), but without that reassurance that in less than two months, everything will be back to the way it was. The start of the semester means people will come back together in Berkeley again. It’s not like doing fun student things was so much better than doing internship things. Objectively, there is a lot of crap that students have to do that isn’t fun at all. I have to sit through humanities classes that put me to sleep, and I have to do CS projects, even if I don’t feel like they’re interesting or educational. Once you’re an adult, you get to cut a lot of the bullshit that kids have to deal with, because you always get a choice, and nobody can make you sit through something so boring that it puts you to sleep4. Also, it isn’t like I see my friends every day, or even every week, when I’m at school. There’s some people I haven’t seen all semester long, so why is not-seeing-them at school better than not-seeing-them in this corporate-provided San Francisco apartment?

I’m kind of disappointed at the percentage of adults that seem to be excited for the next day, every night, compared to the percentage of my school friends do. Why is it so hard to keep friends and not be sad when you’re an adult? I wish I had my calendar and to-do list back by my side, and I wish I actually had stuff to put on them. Sorry to end with something sad, but being an adult sounds like it totally sucks.

  1. It’s not like you can find out via Facebook or anything. ↩︎
  2. I have been making breakfast almost every morning for the last five weeks, and on the weekends, I make all three meals for myself. There’s a Safeway across the street. ↩︎
  3. I can do this at work actually. ↩︎
  4. Ok, I guess your boss can threaten to fire you, but you have a choice to get a new job. ↩︎

Notes and reminders

This is Notes.app, which I use to save rich text and organize ideas. I like it because it’s not a website, it’s a native OS X app. And because it opens in a small window that fits on the side of the screen, I feel creative and comfortable writing notes here.

Notes.app on my desktop.

But it doesn’t sync with my Android phone. It only syncs with an iCloud account, and I don’t use iCloud for anything except iTunes purchases and this. It’s also a little buggy with too much rich text.

I use vim for all my text editing1, and I wanted to use vim for notes too. But it didn’t work out. Rich text lets me put in checkmarks like ✔︎, and I can start bulleted list with wiki-style syntax. There’s a font color palette, and you can paste images and headings into it directly from Safari. I have a ton of places to write stuff, but this one is my favorite.

I also tried Google Keep, but the mobile app is so clunky and there’s no native app. The web interface is awkward too. I don’t like sticky notes. They feel like a lazy way to make reminders.

There’s also a Tasks system that is integrated with Gmail and Google Calendar. I’ve been using that one for a long time, but there is no mobile app. I purchased a third-party mobile app for Tasks2, and I’ve been using that for a long time too. But there’s no native OS X app. The next best thing is a full page web interface for Tasks, which isn’t too bad.

I guess I’ll never be able to see my notes on my phone, or edit my todo list with a native OS X app. It’s the end of the semester anyway, so my todo list is getting shorter and shorter.

  1. I wrote an essay in vim. It has spell check and text wrapping. Pretty good! ↩︎
  2. It’s called Tasks (surprise). ↩︎