The data model of Hubnext

I got my first computer when I was 8. It was made out of this beige-white plastic and ran a (possibly bootlegged) copy of Windows ME¹. Since our house had recently gotten DSL installed, the internet could be on 24 hours a day without tying up the phone line. But I didn’t care about that. I was perfectly content browsing through each of the menus in Control Panel and rearranging the files in My Documents. As long as I was in front of a computer screen, I felt like I was in my element and everything was going to be alright.

Computers have come a long way. Today, you can rent jiggabytes of data storage for literally pennies per month (and yet iPhone users still constantly run out of space to save photos). For most people living in advanced capitalist societies, storage capacity has been permanently eliminated as a reason why you might consider deleting any data at all. For people working in tech, there’s a mindset known as “big data”, where businesses blindly hoard all of their data in the hope that some of it will become useful at some time in the future.

On the other hand, I’m a fan of “small data”. It’s the realization that, for many practical applications, the amount of useful you have is dwarfed by the overwhelming computing and storage capacity of modern computers. It really doesn’t matter how inefficient or primitive your programs are, and that opens up a world of opportunities for most folks to do ridiculous audacious things with their data².

When RogerHub ran on WordPress, I set up master-slave database and filesystem replication for my primary and replica web backends. WordPress needs to support all kinds of ancient shared hosting environments, so WordPress core makes very few assumptions about its operating environment. But WordPress plugins, on the other hand, typically make a lot of assumptions about what kinds of things the web server is allowed to do³. So the only way to really run WordPress in a highly-available configuration is to treat it like a black box and try your best to synchronize the database and filesystem underneath it.

RogerHub has no need for all of that complexity. RogerHub is small data. Its 38,000 comments could fit in the system memory of my first cellphone⁴ and the blobs could easily fit in the included external microSD card. But perhaps more important than the size of the data is how simple RogerHub’s dataset is.

Database replication comes with its own complexities, because it assumes you actually need transaction semantics⁵. Filesystem replication is mostly a crapshoot with no meaningful conflict resolution strategy for applications that use disk like a lock server. But RogerHub really only collects one kind of data: comments. The nice thing about my comments is that they have no relationship to each other. You can’t reply directly to other comments. Adding a new comment is as simple as inserting it in chronological order. So theoretically, all of this conflict resolution mumbo jumbo should be completely unnecessary.

I call the new version of RogerHub “hubnext” internally⁶. Hubnext stores all kinds of data: comments, pages, templates⁷, blobs⁸, and even internal data, like custom redirects and web certificates. Altogether, these different kinds of data are just called “Things”.

One special feature of hubnext is that you can’t modify or delete a Thing, once it has been created (e.g. an append-only data store). This property makes it really easy to synchronize multiple sets of Things on different servers, since each replica of the hubnext software just needs to figure out which of its Things the other replicas don’t have. To make synchronization easier, each Thing is given a unique identifier, so hubnext replicas can talk about their Things by just using their IDs.

Each hubnext replica keeps a list of all known Thing IDs in memory. It also keeps a rolling set hash of the IDs. It needs to be a rolling hash, so that it’s fast to compute H(n₁, n₂, …, n_k, n_k+1), given H(n₁, n₂, …, n_k) and n_k+1. And it needs to be a set hash, so that the order of the elements doesn’t matter. When a new ID set added to the list of Thing IDs, the hubnext replica computes the updated hash, but it also remembers the old hash, as well as the ID that triggered the change. By remembering the last N old hashes and the corresponding Thing IDs, hubnext builds a “trail of breadcrumbs” of the most recently added IDs. When a hubnext replica wants to sync with a peer, it sends its latest N hashes through a secure channel. The peer searches for the most recent matching hash that’s in both the requester’s hashes and the peer’s own latest N hashes. If a match is found, then the peer can use its breadcrumbs to generate a “delta” of newly added IDs and return them back to the requester. And if a match isn’t found, the default behavior is to assume the delta should include the entire set of all Thing IDs.

This algorithm runs periodically on all hubnext replicas. It’s optimized for the most common case, where all replicas have identical sets of Thing IDs, but it also works well for highly unusual cases (for example, when a new hubnext replica joins the cluster). But most of the time, this algorithm is completely unnecessary. Most writes (like new comments, new blog posts, etc) are synchronously pushed to all replicas simultaneously, so they become visible to all users globally without any delay. The synchronization algorithm is mostly for bootstrapping a new replica or catching up after some network/host downtime.

To make sure that every Thing has a unique ID, the cluster also runs a separate algorithm to allocate chunks of IDs to each hubnext replica. The ID allocation algorithm is an optimistic majority consensus one-phase commit with randomized exponential backoff. When a hubnext replica needs a chunk of new IDs, it proposes a desired ID range to each of its peers. If more than half of the peers accept the allocation, then hubnext adds the range to its pool of available IDs. If the peers reject the allocation, then hubnext just waits a while and tries again. Hubnext doesn’t make an attempt to release partially allocated IDs, because collisions are rare and we can afford to be wasteful. To decide whether to accept or reject an allocation, each peer only needs to keep track of one 64-bit ID, representing the largest known allocated ID. And to make the algorithm more efficient, rejections will include the largest known allocated ID as a “hint” for the requester.

There are some obvious problems with using an append-only set to serve website content directly. To address these issue, each Thing type contains (1) a “last modified” timestamp and (2) some unique identifier that links together multiple versions of the same thing. For blobs and pages, the identifier is the canonicalized URL. For templates, it’s the template’s name. For comments, it’s the Thing ID of the first version of the comment. When the website needs to fetch some website content, it only considers the instance of the data with the latest “last modified” timestamp among multiple Things with the same identifier.

Overall, I’m really satisfied with how this data storage model turned out. It makes a lot of things easier, like website backups, importing/exporting data, and publishing new website content. I intentionally glossed over the database indexing magic that makes all of this somewhat efficient, but that’s nonetheless present. There’s also an in-memory caching layer for the most commonly-requested content (like static versions of popular web pages and assets). Plus, there’s some Google Cloud CDN magic in the mix too.

It’s somewhat unusual to store static assets (like images and javascript) in a relational database. The only reason why I can get away with it is because RogerHub is small data. The only user-produced content is plaintext comments, and I don’t upload nearly enough images to fill up even the smallest GCE instances.

Anyway, have a nice Friday. If I find another interesting topic about Hubnext, I’ll probably write another blog post like this one soon.

A bridge in Kamikochi, Japan.

But not for long, because I found install disks for Windows 2000 and XP in the garage and decided to install those. ↩︎
I once made a project grading system for a class I TA’ed in college. It ran on a SQLite database with a single global database lock, because that was plenty fast for everybody. ↩︎
Things like writing to any location in the web root and assuming that filesystem locks are real global locks. ↩︎
a Nokia 5300 with 32MB of internal flash ↩︎
I’ve never actually seen any WordPress code try to use a transaction. ↩︎
Does “internally” even mean anything if it’s just me? ↩︎
Templates determine how different pages look and feel. ↩︎
Images, stylesheets, etc. ↩︎

Post a Comment