Code.RogerHub » networking

Archiving and compressing a dynamic web application

Roger Chen — Sun, 01 Sep 2013 05:04:22 +0000

From 1999 to mid-2011, the Daily Cal used an in-house CMS to run the website. It contains around 65,000 individual articles and thousands of images. But ever since we moved to WordPress, the old system has been collecting dust in the corner. It was about time that all of the old CMS content was archived as static HTML so that it could be served indefinitely in the future as server software evolves. To accomplish this, I set up a linux virtual machine with my trusty vagrant up utility on my spare home server.

Retrieving the application data

The CMS’s production server did not actually have enough free disk space to create an tarball copy of the application. But since there were on the order of 10,000 files involved, a recursive copy via scp would be too slow. In order to speed up the transfer process, I used a little gnu/linux philosophy and piped a few things together:

$ screen
$ ssh root@domain "tar -c /srv/archive" | tar xvf -

I decided that compression was not going to be very helpful because most of the data was jpeg and png images, which are not very compressible. Enabling compression would just slow things down, since the bottleneck would become the CPU rather than the network.

Preparing the application

The CMS had very few dependencies, which is not surprising given the state of PHP 10 years ago. I set up a simple nginx+php-fpm+mysql configuration with a single PHP worker thread. The crawl operation would be executed in serial anyway, so multiple workers would not be useful.

I also added an entry to the VM’s hosts file for the hostname of the production server. The server’s hostname was hardcoded in a few places, and I didn’t want the crawl operation sending requests out to Internet. Additionally, I set up a secondary web server configuration that served generated HTML from the output directory and static assets from the application data, so that I could preview the results as they were being generated.

Crawling the application

Generating the static pages was the hardest part of the archival process, and it took me around 5 retries to get it right. The primary purpose of the whole archival operation was to remove the dynamic portions of the web application. This meant that static versions of every single conceivable page request had to be run against the application and saved. I picked wget as my archival tool.

Articles in the CMS were stored in one of two places. However, the format of the URL was luckily the same for both. I dumped article IDs from the MySQL database and created a seed file of article URLs:

$ echo "SELECT article_id FROM dailycal.article;" | mysql -u root > article_ids
$ echo "SELECT id FROM dailycal.h_articles;" | mysql -u root  >> article_ids
$ sed -e 's/^/http:\/\/archive.dailycal.org\/article.php?id=/' -i article_ids

I didn’t see much point in setting a root password for the local MySQL installation, since this was a single-use VM anyway. Sed ate through the 65,326 article IDs in seconds. I then created a second seed file containing just the URL of the application root, from where (nearly) all other pages would be crawlable.

On the crawler’s final run, I set the following command line switches:

--trust-server-names – Sets the output file name according to the last request in a redirection chain. By default, wget uses the first request.
--append-output=generator.log – Outputs progress information to a file, so that I can run the main process in screen and monitor it with tail in follow mode.
--input-file=source.txt – Specifies the seed file of URLs.
--timestamping – Sets the file modification time according to HTTP headers, if available.
--no-host-directories – Disables the creation of per-host directories.
--default-page=index.php – Defines the name for request paths that end in a slash.
--html-extension – Appends the html file extension to all generated pages, even if another extension already exists.
--domains archive.dailycal.org – Restricts the crawl to only the application domain.

Additionally, I set the following switches to crawl through the links on the application’s root page.

-r – Enables recursive downloading.
--mirror – Sets infinite recursion depth and other useful options.

In total, 543MB of non-article HTML and 2.1GB of article HTML were generated. These are reasonable sizes, given how many URLs were crawled in total, but they were still a bit unwieldy to store. I looked for a solution.

Serving from a compressed archive

I knew a couple of data facts about the generated HTML:

There was tons of redundancy. The articles shared much of their header and footer markup.
Virtually all of the data consisted of printable characters and whitespace, which means considerably less unique information than 8 bits per byte.

Both of these factors made the HTML a good candidate for archive compression. My first thought was tar+gzip, but tar+gzip compression works on blocks, not files. To extract a single file, you’d need to parse all the data up till that file. A request for the last file in the archive could take 15 to 20 seconds!

Luckily, the zip file format maintains an index of individual files and compresses them individually, which means that single file extraction is instantaneous no matter where in the archive it is located. I opted to use fuse-zip, an extension for FUSE that lets you mount zip files onto the file system. Fully compressed, the 543MB of pages became a 92MB zip archive (83% deflation), and the 2.1GB of articles became a 407MB zip archive (81% deflation).

After everything was finished, I uploaded the newly created HTML archives to a new production server and shut down the old CMS. From there, a decade’s worth of archived articles can be maintained indefinitely for the future.

Setting up and testing two bridged wifi routers

Roger Chen — Sun, 11 Aug 2013 05:23:06 +0000

The walls and microwaves of my house have always conspired to cripple the wifi signal in some rooms, especially the ones downstairs and the backyard. I recently got another wifi router to expand the range. They are daisy-chained from the modem with ethernet cables. My servers and printers are connected to the router in the middle, so it takes responsibility for forwarding ports for virtual services and static IP/MAC bindings. The router at the end of the chain is just there for range. I’m just going to quickly document how I set this up and tested it.

I set up the routers through their web interfaces over ethernet on my laptop. Here are some things to double check before you hook up the devices:

The secondary router is set to receive its WAN configuration from DHCP. I tried a static configuration, but it refused to connect for reasons unknown.
If you need to migrate settings (especially between routers of different models/brands), take down all the configuration settings beforehand, including forwarded services, IP/MAC bindings, DHCP and subnet ranges, QoS, static routing, if you’re using them, etc.

After the devices are set up and hooked up in their proper positions, perform a quick AP scan with your wireless card:

$ sudo iwlist wlan0 scan
wlan0   Scan completed:
        Cell 01 - Address:  XX:XX:XX....
                  Channel:  ...

There should be 2 (or more) access point results that correspond to each of your routers. Configure your local wireless card to connect to each router in turn by specifying its MAC address in your OS’s configuration. Run diagnostics and make sure the connection is operational:

$ ip addr
... (an assigned address on the correct subnet) ..
$ curl ifconfig.me
... (your public IP) ..

That’s it. Now breathe easier knowing you can watch Netflix in the yard without suffering degraded streams. Hoorah.

Email Basics: Introduction to SMTP

Roger Chen — Tue, 25 Jun 2013 04:01:49 +0000

Note: This is part 1 of a two-part segment on Email Basics. You can also read part 2 of this segment, which deals with the internal format of emails.

With all of the government spying scares circulating recently, many people have started taking another look at the companies and people responsible for their email. Like most technology today, email is easy to use, but difficult to understand, and despite email’s widespread adoption, most people only have the vague notion of massive computers routing messages over some obscure protocol. It seems to be getting harder and harder to trust a third party with something as personal and sensitive as your email account, but a lot of this mistrust comes from a fundamental lack of understanding of how email works.

Email runs parallel and independently of the World Wide Web, or what we’ve come to think of as the Internet. In reality, websites only make up one of many different application-level protocols that make up the whole of all Internet traffic. Email is actually made up of 2 separate protocols that work together, but were designed to be able to operate independently. These are: SMTP, which describes how email is passed along to its destination, and RFC822, which describes the abilities and format of email. These two are Internet Standards, published by the IETF (Internet Engineering Task Force), and anybody who wants to be able to send or receive email follows them.

The SMTP protocol is defined in a document known as RFC821 (which is right before RFC822). I left RFC822 for part 2 of this blog series and will focus on describing SMTP here.

SMTP is responsible for ensuring delivery of a package. It couldn’t care less about what that package contained or how it was structured, so long as it was delivered. SMTP runs on the TCP protocol, which is also used for things like websites. The TCP protocol is able to distinguish between somebody requesting a website and something sending an email by assigning to each of its applications one or more port numbers. SMTP has been assigned port 25.

The nifty thing about TCP is that provides a lot of things for free and takes care of all the messy business. Some of these things include:

Sending messages of any size
Guaranteeing that everything you send will be delivered
Providing 2-way reliable communication without stuff getting messed up in the middle

For this reason, a lot of Internet stuff is built on top of TCP to take advantage of all the existing infrastructure that supports it. So what goes through TCP? Well, just about anything. You can send anonymous encouragement like so:

   You: $(nc g.co 80)
   You: Hey you! You're the best!
Server: ...

You can receive messages back in the same way. In this example, there is a distinction between the client (You) and the server (computer that receives your messages). However, TCP is completely symmetric once the initial connection is established: both you and the server can send data, receive data, and close the connection.

The SMTP Conversation

Package delivery through SMTP works like a conversation. This conversation takes place between a client, who has a package to deliver, and a server, who is receiving the package. Once the package is received by the server, the client can forget about it and the server assumes full responsibility for the package and its delivery. In this way, servers sometimes become clients themselves if they need to forward packages on to other servers.

Like any good conversation, SMTP begins with an introduction by both parties:

Server: 220 localhost Greetings
   You: HELO roger
Server: localhost

In case you’ve got bad memory, the server gives you its name twice. Once when you open the connection, and again after you introduce yourself with a HELO. The names used here are special names known as hostnames. In real-world SMTP, these hostnames are usually fully-qualified domain names (FQDN’s) and look like mail.rogerhub.com.

You’ve both said helo. Now we get down to business. The usual thing to do at this point is to tell the server about the package you’re delivering. Of course, if you’re shy you can just walk away.

   You: QUIT
Server: 221 Service closing transmission channel

To declare that you’ve got a package to deliver, the MAIL FROM command is used. The command is followed with a single colon and then the email address of the sender.

   You: MAIL FROM:Finn the Human 
Server: 250 OK

Next, you declare the recipients of the package with RCPT TO. If there are more than one, which happens often, then you declare each recipient one at a time.

   You: RCPT TO:Jake the Dog 
Server: 250 OK
   You: RCPT TO:BMO 
Server: 250 OK

So far so good. We are not far from the end. Once the formalities are exchanged, you can begin delivering the package with the DATA command. After you’ve delivered the whole package, you tell the server that you’re done by sending a single period on its own line.

   You: DATA
Server: 354 Start mail input; end with .
   You: Jake,

        Left the house for a few hours. Will be back
        soon.

        Not joking,
        Finn
        .
Server: 250 OK

That’s it! At this point, you can say goodbye with QUIT or send another email. (These command responses are straight from a demo SMTP server implementation on my GitHub.)

But this can’t be all there is! Is there more to SMTP? You bet there is!

In fact, SMTP and RFC822 have both seen many revisions to the specification over the years. One of the major changes to the specification is in the initial greeting. Instead of HELO, an alternate greeting EHLO, or extended helo, was proposed. When you greet an EHLO-aware server with EHLO, it will send you an extended SMTP (or ESMTP) response like so:

Server: 220 localhost Greetings
   You: EHLO roger
Server: 250-localhost
Server: 250-PIPELINING
Server: 250 SIZE 10000000

The server sends a list of SMTP extensions that it supports. These include things like STARTTLS, which provides a way to upgrade a plaintext connection (like all the transmissions seen here) to an anonymous, encrypted TLS session, which is more resilient to snooping. Both the client and the server have to support an extension before you can use it. Extensions open up the path to a some extra-cool stuff like authentication and UTF-8 support.

You can try out this whole process by sending an email to a friend, just like a real mail server would do. The first step is to find the IP address of the server to contact! The hostname of your request is the hostname of the email address (e.g. gmail.com), and the query type is mx, which stands for Mail Exchanger.

$ nslookup -q=mx gmail.com
...

After that, you just need to open a TCP connection to one of the listed Mail Exchangers. You can do this with either netcat or telnet, both of whose availability depend on your operating system.

$ nc example.com 25       # netcat
$ telnet example.com 25   # telnet

Fire away! (But don’t be surprised if your phony emails are discarded as spam.. Haha)