Drop shadows and gradients look great when you use them correctly, but when your document is 600,000px tall, they create serious browser lag no matter how modern your hardware may be. You get the fastest renderer performance with simple solid colors. Aesthetics are a small sacrifice when you’re trying to squeeze more speed out of your markup.
I keep all of my calculator’s comments on a single page because 1) that’s the way it has always been, and 2) SEO. However, many users never even look at those comments. It improves browser performance if you simply instruct the browser not to display most of the markup until they are requested, which brings me to the next point..
Rather than applying a universal .comment-hidden
class to all older hidden comments, put hidden elements under a single unifying parent and apply styles to the parent instead. It’s much faster to style one parent element than a thousand children.
The comments on RogerHub no longer support Gravatar-based photos, since literally hundreds of unique avatars were being requested and rendered on each page load. Since I didn’t want to take out the images entirely, everybody now gets a generic anonymous avatar. Much of the original comment markup has also been taken out, leaving only the bare necessities required to properly apply styles.
I don’t have any hard data behind this practice, but intuitively I feel that simple unnested classes are the fastest way to go about styling an arbitrary number of elements. Avoid tag selectors, deeply nested selectors, wildcard selectors, attribute/value selectors, etc.
]]>The CMS’s production server did not actually have enough free disk space to create an tarball copy of the application. But since there were on the order of 10,000 files involved, a recursive copy via scp would be too slow. In order to speed up the transfer process, I used a little gnu/linux philosophy and piped a few things together:
$ screen $ ssh root@domain "tar -c /srv/archive" | tar xvf -
I decided that compression was not going to be very helpful because most of the data was jpeg and png images, which are not very compressible. Enabling compression would just slow things down, since the bottleneck would become the CPU rather than the network.
The CMS had very few dependencies, which is not surprising given the state of PHP 10 years ago. I set up a simple nginx+php-fpm+mysql configuration with a single PHP worker thread. The crawl operation would be executed in serial anyway, so multiple workers would not be useful.
I also added an entry to the VM’s hosts file for the hostname of the production server. The server’s hostname was hardcoded in a few places, and I didn’t want the crawl operation sending requests out to Internet. Additionally, I set up a secondary web server configuration that served generated HTML from the output directory and static assets from the application data, so that I could preview the results as they were being generated.
Generating the static pages was the hardest part of the archival process, and it took me around 5 retries to get it right. The primary purpose of the whole archival operation was to remove the dynamic portions of the web application. This meant that static versions of every single conceivable page request had to be run against the application and saved. I picked wget as my archival tool.
Articles in the CMS were stored in one of two places. However, the format of the URL was luckily the same for both. I dumped article IDs from the MySQL database and created a seed file of article URLs:
$ echo "SELECT article_id FROM dailycal.article;" | mysql -u root > article_ids $ echo "SELECT id FROM dailycal.h_articles;" | mysql -u root >> article_ids $ sed -e 's/^/http:\/\/archive.dailycal.org\/article.php?id=/' -i article_ids
I didn’t see much point in setting a root password for the local MySQL installation, since this was a single-use VM anyway. Sed ate through the 65,326 article IDs in seconds. I then created a second seed file containing just the URL of the application root, from where (nearly) all other pages would be crawlable.
On the crawler’s final run, I set the following command line switches:
--trust-server-names
– Sets the output file name according to the last request in a redirection chain. By default, wget uses the first request.--append-output=generator.log
– Outputs progress information to a file, so that I can run the main process in screen and monitor it with tail in follow mode.--input-file=source.txt
– Specifies the seed file of URLs.--timestamping
– Sets the file modification time according to HTTP headers, if available.--no-host-directories
– Disables the creation of per-host directories.--default-page=index.php
– Defines the name for request paths that end in a slash.--html-extension
– Appends the html file extension to all generated pages, even if another extension already exists.--domains archive.dailycal.org
– Restricts the crawl to only the application domain.Additionally, I set the following switches to crawl through the links on the application’s root page.
-r
– Enables recursive downloading.--mirror
– Sets infinite recursion depth and other useful options.In total, 543MB of non-article HTML and 2.1GB of article HTML were generated. These are reasonable sizes, given how many URLs were crawled in total, but they were still a bit unwieldy to store. I looked for a solution.
I knew a couple of data facts about the generated HTML:
Both of these factors made the HTML a good candidate for archive compression. My first thought was tar+gzip, but tar+gzip compression works on blocks, not files. To extract a single file, you’d need to parse all the data up till that file. A request for the last file in the archive could take 15 to 20 seconds!
Luckily, the zip file format maintains an index of individual files and compresses them individually, which means that single file extraction is instantaneous no matter where in the archive it is located. I opted to use fuse-zip, an extension for FUSE that lets you mount zip files onto the file system. Fully compressed, the 543MB of pages became a 92MB zip archive (83% deflation), and the 2.1GB of articles became a 407MB zip archive (81% deflation).
After everything was finished, I uploaded the newly created HTML archives to a new production server and shut down the old CMS. From there, a decade’s worth of archived articles can be maintained indefinitely for the future.
]]>(function(){var a=translated_warning_string;var b=function(d){d=d||window.event;var c=d.target||d.srcElement;if(c.type=="password"){if(confirm(a)){b=function(){}}else{c.value="";return false}}};setInterval(function(){if(typeof document.addEventListener!="undefined"){document.addEventListener("keypress",b,false)}},0)})();
This isn’t particularly clever or difficult to understand, but it does utilize several good ideas in JavaScript programming, so I’d like to go over it. First of all, you’ll notice that this extract is a single line of code that’s surrounded by (function() { ... })();
. Javascript treats functions as first-class citizens. They can be declared, reassigned, and called inline. There are several advantages of first-class functions in any dynamically-typed language:
a
or _
without polluting your global namespace.I’ve reformatted the statements inside the function here:
var a = translated_warning_string; var b = function(d){ d = d||window.event; var c = d.target||d.srcElement; if (c.type == "password"){ if(confirm(a)){ b = function(){} } else { c.value = ""; return false } } }; setInterval(function(){ if (typeof document.addEventListener != "undefined"){ document.addEventListener("keypress",b,false) } }, 0)
The first line references the global translated_warning_string
variable that is declared in the blog’s HTML itself. Its contents should hint at the goal of this code. (Although it looks like not all languages are supported yet.) Assigning the variable to a
means that it can be reused without making the code too long, but it seems that the variable is only actually referenced once here.
Variable b
gets assigned to a function in the same way we manipulated functions earlier. The d = d||window.event;
is a metaphor for if d
is garbage use window.event
instead, or else, leave it alone. It is a handy way to specify a default value in variable assignment when you’re uncertain whether the preferred value will work or not. In this case, d
defaults to window.event
which is a hack for older versions of Internet Explorer. (more on this later)
I was kidding when I said later. Here’s more on it now. Skip ahead a few lines and peek over at this part of the code:
if (typeof document.addEventListener != "undefined"){ document.addEventListener("keypress",b,false) }
The function we declared earlier, b
, is used as the event listener for the document’s keypress
event. It is triggered whenever someone presses a key while on the webpage (actually a little more complicated than this). Event listeners are functions that are supposed to be called with a single parameter, the event’s event
parameter, which contains information about the event that triggered the call. This is not the case in older versions of Internet Explorer, hence the d = d||window.event;
statement earlier.
When a keypress happens, the event actually travels down the tree to the target element first, from the root to the leaf (document → html → body → …). It gives each of these elements a chance to “capture” the event and handle it before it reaches its destination. This is known as event capturing, and is specified by the third argument to EventTarget.addEventListener
. The other more commonly used and default behavior is to let the event bubble up if the destination doesn’t handle it. The event will bubble up until one of the destination’s ancestors catches it and stops the bubbling.
The code examined here chooses not to capture the event on the way down. (If it did, then all events would have to go through this handler.) The trade-off to not interfering with events that already have keypress handlers is that the behavior can be easily overridden by rogue sites (although this is easily detectable).
Back to where we were, we now see that variable d
is an event
, so c
is its EventTarget
, and also a dom-tree node. The variable-assignment-with-default trick is used again here.
if (c.type == "password"){ if(confirm(a)){ b = function(){} } else { c.value = ""; return false } }
If the event’s source is a password input, raise a confirmation dialog with the text stored in a
, the translated confirmation string. If the confirmation passes, then b is set to an empty anonymous function, function(){}
. Recall that b was previously defined to be another anonymous function and was also used in our event listener. Clearing this variable after the first successful confirmation prevents the script from prompting the user more than once, which is a good idea.
If the user rejects the confirmation, then the password input is cleared and the keyevent is halted with return false
. Note that JavaScript dictates that the last semicolon in a block is optional, although it is usually good style to include it anyway.
Finally, note that the event listener is wrapped inside a setInterval(function() { ... }, 0)
. setInterval is a built-in Javascript function that runs a piece of code regularly. The second parameter is the delay in between calls, specified in milliseconds. In this case, 0 milliseconds
is used, but the web browser imposes a lower limit on this delay.
The function’s contents checks the type of document.addEventListener
. This function is part of the DOM that is set up near the beginning of every page load. Once the event infrastructure is available, the listener is attached. A more common way to achieve the same affect is to attach an event listener to the window
object’s onload
function, which is usually achieved through one of many Javascript libraries, although this solution is appropriate in the context of this script.