The serving path of Hubnext
I’ve always liked things straight and direct to the point. But I feel a bit silly writing that here, since most of my blog posts from high school used a bunch of long words for no reason. My writing did eventually become more concise, but it happened around the time I also sort of stopped blogging. I stopped writing altogether for a few years, but then in college, I started doing an entirely different kind of writing. I wrote homework assignments and class announcements for the computer science classes as a teaching assistant. Technical writing is different from blogging, but writing for a student audience also has its own unique challenges. The gist of it is that students don’t like reading, which is naturally at odds with the long detailed project specs that some of our projects require. It’s the instructor’s responsibility to make sure the important details stand out from the boring reference material.
In college, I also started eating lunch and dinner on the go regularly, which is something I had never really done before. It was a combination of factors that made walk-eat meals so enticing. The campus was big, so getting from place to place usually involved a bit of walking. Street level restaurants sold fast food in convenient form factors. Erratic scheduling meant it was unusual to be with another person for more than a couple hours at a time, so sit-down meals didn’t always make sense. I got really good at the different styles of walk-eating, from gnawing on burritos without spilling to balancing take-out boxes between my fingers. Eating on my feet felt like the most distilled honest form of feeding. It didn’t make sense to add so many extra steps to accomplish essentially the same task.
For a long time, the serving path of this website was too complicated. The serving path is all the things directly involved in delivering web pages to visitors, which can be contrasted with auxiliary systems that might be important, but aren’t immediately required for the website to work. I made my first website more than 12 years ago. Lots of things have changed since then, both with me and with the ecosystem of web development in general. So when I set out to reimplement RogerHub, I wanted to eliminate or replace as many parts of the serving path as I could.
Let’s start with the basics. Most of the underlying infrastructure for RogerHub hasn’t changed since I moved the website from Linode to Google Cloud Platform in January 2016. I use Google’s Cloud DNS and HTTP(S) Load Balancing to initially capture inbound traffic. Google’s layer 7 load balancing provides global anycast, geographic load balancing, TLS termination, health checking, load scaling, and HTTP caching, among other things. It’s the best offering available at its low price point, so there’s little reason to consider alternatives1. However, HTTP(S) Load Balancing did have a 2 hour partial outage last October. I don’t get much traffic in October, but the incident made me start thinking of ways I could mitigate a similar outage in the future.
Behind the load balancer, RogerHub is served by a number of functionally identical backend servers (currently 2). These servers are fully capable of serving user traffic directly, but they currently only accept traffic from Google’s designated load balancer IP ranges. During peak seasons, I usually scale this number up to 3, but the extra server costs me about $50 a month, so I usually prefer to run just 2. These extra servers exist entirely for redundancy, not capacity. A single server could serve a hundred times peak load with no problem2.
Each server is a plain old Ubuntu host running an instance of PostgreSQL and an instance of the Hubnext application. That’s it. There’s no reverse proxy, no memcached, no mail server, and not even a provisioner (Hubnext knows how to install and configure its own dependencies). Hubnext itself can run in multiple modes, most of which are for maintenance and debugging. But my web servers start Hubnext in “application” mode. When Hubnext runs in application mode, it starts an ensemble of concurrent tasks, only one of which is the web server. The others are things like an unique ID allocator, peer to peer data syncing, system health checkers, maintenance daemons, and half a dozen kinds of in-memory caches. Hubnext can renew its own TLS certificate, create backups of its data, and even page me on my iPhone when things go wrong. Before Hubnext, these tasks were handled by haphazard cron jobs and independent services, which were set up and configured by a Puppet provisioner. Keeping all these tasks in a single application has made developing new features a lot simpler.
So why does Hubnext still need PostgreSQL? It’s true that Hubnext could have simply kept track of its own data, along with maintaining the various indexes that make common operations fast. But it’s an awful lot of work and unneeded complexity to implement something that a database already does for free. Of all the components of a traditional website’s architecture, I chose to keep the database, because I think PostgreSQL pulls its weight more than any of the other systems that Hubnext does supplant. That being said, Hubnext intentionally doesn’t use the transactions or replication that PostgreSQL provides (e.g. the parts of a database most sensitive to failure). Instead, Hubnext’s data model is designed to work without multi-statement transactions and Hubnext performs its own application level data replication, which is probably easier to configure and troubleshoot than database replication3.
When a request arrives at a Hubnext server, it gets wrapped in a tracker (for logging and metrics) and then enters the web request router. Instead of matching by request path, the Hubnext request router gives each registered web handler a chance to accept responsibility for a request before it falls through to the next handler. This allows for more complex and free-form routing logic. The web router starts with basic things like an OPTIONS handler and DoS rejection. The vast majority of requests are handled by the Hubnext blob cache and page cache. These handlers keep bounded in-memory caches of popular blobs (images, JavaScript, and CSS) and pre-rendered versions of popular pages. But even on a cache hit, the server still runs some code to process things like cache headers, ETags, and compression.
Blobs whose request path starts with /_version_
get long cache expiration times, which instructs Google Cloud CDN to cache them. Overall, about 40% of the requests to RogerHub are served from the CDN without ever reaching Hubnext. You can distinguish these cache hits by their Age header. ETags are generated from the SHA256 hash of the blob’s path and payload. Compression is applied to blobs based on a whitelist of compressible MIME types. The blob cache holds both the original and the gzipped versions of the payload, along with pre-computed ETags and other metadata, like the blob’s modification date.
All of the remaining requests4 eventually make their way to PostgreSQL for content. As I said in a previous blog post, Hubnext stores all of its content in PostgreSQL, including binary blobs like images and PDFs. This isn’t a problem, because Hubnext communicates with its database over a local unix socket and most blobs are cached in memory anyway. This does prevent Hubnext from using sendfile to deliver blobs like filesystem-based web servers do, but there aren’t many large files hosted on RogerHub anyway.
At this point, nearly all requests have already been served by one of the previously mentioned handlers. But the majority of Hubnext’s web server code is dedicated to serving the remaining fraction. This includes archive pages, search results, new comment submissions, the RSS feed, and legacy redirect handlers. If all else fails, Hubnext gives up and returns a 404. The response is sent back to the load balancer over HTTP/1.1 with TLS5 and then forwarded on to the user. Thus completes the request.
Is the serving path shorter than you imagined? I think so6. More code sometimes just means additional maintenance burden and a greater attack surface. But altogether, Hubnext actually consists of far less code than the code it replaces. Furthermore, it’s nearly all code that I wrote and it’s all written in a single language as part of a single system. Say what you want about Go, but I think it’s the best language for implementing traditional CRUD services, especially those with a web component7. Hubnext is a distillation of what it means to deliver RogerHub reliably and efficiently to its audience, without the frills and patches of a traditional website. Anyway, I hope this post was a good distraction. Until next time!
- Plus, I happen to work for Google (but I didn’t when I first signed up with GCP), so I hear a lot about these services at company meetings. ↩︎
- If required, Hubnext could theoretically run on 10-20 instances with no problem. But the peer to peer syncing protocol is meant to be a fully connected graph, so at some point, it might run into issues with too much network traffic. ↩︎
- Hubnext also doesn’t use PostgreSQL’s backup tools, because Hubnext can create application-specific backups that are more convenient and understandable than traditional database backups. ↩︎
- Some pages simply aren’t cached, like search result pages and 404s. ↩︎
- Google HTTP(S) Load Balancing uses HTTP/1.1 exclusively when forwarding requests to backend servers, although it prefers HTTP/2 on the front end. ↩︎
- But for several years, RogerHub just ran as PHP code on cheap shared web hosts, which makes even this setup seem overly complicated. But it’s actually quite a bit of work to run and maintain a modern PHP environment. You’ll probably need a dozen C extensions and a reverse proxy and perhaps a suite of cron jobs to back up all your files and databases. For a long time, I followed a bunch of announcement mailing lists to get news about security fixes in PHP and WordPress and MediaWiki and NGINX and MySQL and all the miscellaneous packages on ubuntu-security-announce. This new serving path means I really only need to keep an eye out on security announcements for Go (which tend to be rare and fairly mild) or OpenSSH or maybe PostgreSQL. ↩︎
- After all, that covers a lot of what Google does. ↩︎
2 CommentsAdd one
Hi Roger,
My semester ends in two weeks. I have an 83 now on X2. I am taking a test that counts as 85% of my overall grade. What do I need on the test to have at least a 90% as my overall grade
Roger: How many tests have you taken so far? And what did you score on them? Is each of your tests worth an equal fraction of your “tests” category?
HELP! I HAVE A WEEK UNTIL SCHOOL ENDS! I HAVE A 70.26% IN SPANISH CLASS!! I NEED AN 80% TO PASS! THE ONLY THING LEFT IS A CW ASSIGNMENT. ITS WORTH 70 POINTS AND CW IS 40% WHAT DO I NEED?!?!? If i get 70/70 will i get to the 80%????????????????
Roger: How many classwork points do you have so far? And what is your classwork average? Also, do you have any empty categories in your grade? (For example, a “participation” category that has no score yet.)