Code.RogerHub » servers

MongoDB power usage on laptop

Roger Chen — Fri, 27 Mar 2015 22:51:55 +0000

I’m setting up a Chef-based Ubuntu VM on my laptop, so I can get sandboxing and all the usual Ubuntu tools. Most things work on OS X too, but it’s a second-class citizen compared to poster child Ubuntu. I have Mongo and a few services set up there, for school stuff, and I noticed that Mongo regularly uses a significant amount of CPU time, even when it’s not handling any requests.

I attached strace to it with strace -f -p 1056, and here’s what I saw:

[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1112] <... nanosleep resumed> 0x7fd3895319a0) = 0
[pid 1112] nanosleep({0, 34000000}, 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1112] <... nanosleep resumed> 0x7fd3895319a0) = 0

I looked online for an explanation, and I found this bug about MongoDB power usage. Apparently, it uses select() as a way to do timekeeping for all the ways Mongo uses wall time. Somebody proposed that alternative methods be used instead, which don’t require this kind of tight looping, but it looks like the Mongo team is busy with other stuff right now and won’t consider the patch.

So I added a clause in my Chef config to disable the MongoDB service from starting automatically on boot. I needed to specify the Upstart service provider explicitly in the configuration, since the mongodb package installs both a SysV script and Upstart manifest.

Taking advantage of cloud VM-driven development

Roger Chen — Thu, 10 Jul 2014 06:39:48 +0000

Most people write about cloud computing as it relates to their service infrastructure. It’s exciting to hear about how Netflix and Dropbox et al. use AWS to support their operations, but all of those large-scale ideas don’t really mean much for the average developer. Most people don’t have the budget or the need for enormous computing power or highly-available SaaS data storage like S3, but it turns out that cloud-based VM’s can be highly useful for the average developer in a different way.

Sometimes, you see something like this one-liner installation script for Heroku Toolbelt, and you just get nervous:

wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh

Not only are they asking you to run a shell script downloaded over the Internet, but the script also asks for root privileges to install packages and stuff. Or, maybe you’re reading a blog post about some HA database cluster software and you want to try it out yourself, but running 3 virtual machines on your puny laptop is out of the questions.

To get around this issue, I’ve been using DigitalOcean machines for when I want to test something out but don’t want to go to the trouble of maintaining a development server or virtual machines. Virtualized cloud servers are great for this because:

They’re DIRT CHEAP. DO’s smallest machine costs $0.007 an hour. Even if you use it for 2 hours, it rounds down to 1 cent.
The internet connection is usually a lot better than whatever you’re using. Plus, most cloud providers have local mirrors for package manager stuff, which makes installing packages super fast.
Burstable CPU means that you can get an unfair amount of processing power for a short time at the beginning, which comes in handy for initially installing and downloading all the stuff you’ll want to have on your machine.

I use the tugboat client (a CLI ruby app) to interface with the DigitalOcean API. To try out MariaDB Galera clustering, I just opened up three terminals and had three SSH sessions going on. For source builds that have a dozen or more miscellaneous dependencies, I usually just prepare a simple build script that I can upload and run on a Cloud VM whenever I need it. When I’m done with a machine, I’ll shut it down until I need a machine again a few days later.

Running development virtual machines in the cloud might not drastically change your workflow, but it opens up a lot of opportunities for experimentation and massive computing resources when you want it. So load up a couple dollars onto a fresh DigitalOcean account and boot up some VMs!

Archiving and compressing a dynamic web application

Roger Chen — Sun, 01 Sep 2013 05:04:22 +0000

From 1999 to mid-2011, the Daily Cal used an in-house CMS to run the website. It contains around 65,000 individual articles and thousands of images. But ever since we moved to WordPress, the old system has been collecting dust in the corner. It was about time that all of the old CMS content was archived as static HTML so that it could be served indefinitely in the future as server software evolves. To accomplish this, I set up a linux virtual machine with my trusty vagrant up utility on my spare home server.

Retrieving the application data

The CMS’s production server did not actually have enough free disk space to create an tarball copy of the application. But since there were on the order of 10,000 files involved, a recursive copy via scp would be too slow. In order to speed up the transfer process, I used a little gnu/linux philosophy and piped a few things together:

$ screen
$ ssh root@domain "tar -c /srv/archive" | tar xvf -

I decided that compression was not going to be very helpful because most of the data was jpeg and png images, which are not very compressible. Enabling compression would just slow things down, since the bottleneck would become the CPU rather than the network.

Preparing the application

The CMS had very few dependencies, which is not surprising given the state of PHP 10 years ago. I set up a simple nginx+php-fpm+mysql configuration with a single PHP worker thread. The crawl operation would be executed in serial anyway, so multiple workers would not be useful.

I also added an entry to the VM’s hosts file for the hostname of the production server. The server’s hostname was hardcoded in a few places, and I didn’t want the crawl operation sending requests out to Internet. Additionally, I set up a secondary web server configuration that served generated HTML from the output directory and static assets from the application data, so that I could preview the results as they were being generated.

Crawling the application

Generating the static pages was the hardest part of the archival process, and it took me around 5 retries to get it right. The primary purpose of the whole archival operation was to remove the dynamic portions of the web application. This meant that static versions of every single conceivable page request had to be run against the application and saved. I picked wget as my archival tool.

Articles in the CMS were stored in one of two places. However, the format of the URL was luckily the same for both. I dumped article IDs from the MySQL database and created a seed file of article URLs:

$ echo "SELECT article_id FROM dailycal.article;" | mysql -u root > article_ids
$ echo "SELECT id FROM dailycal.h_articles;" | mysql -u root  >> article_ids
$ sed -e 's/^/http:\/\/archive.dailycal.org\/article.php?id=/' -i article_ids

I didn’t see much point in setting a root password for the local MySQL installation, since this was a single-use VM anyway. Sed ate through the 65,326 article IDs in seconds. I then created a second seed file containing just the URL of the application root, from where (nearly) all other pages would be crawlable.

On the crawler’s final run, I set the following command line switches:

--trust-server-names – Sets the output file name according to the last request in a redirection chain. By default, wget uses the first request.
--append-output=generator.log – Outputs progress information to a file, so that I can run the main process in screen and monitor it with tail in follow mode.
--input-file=source.txt – Specifies the seed file of URLs.
--timestamping – Sets the file modification time according to HTTP headers, if available.
--no-host-directories – Disables the creation of per-host directories.
--default-page=index.php – Defines the name for request paths that end in a slash.
--html-extension – Appends the html file extension to all generated pages, even if another extension already exists.
--domains archive.dailycal.org – Restricts the crawl to only the application domain.

Additionally, I set the following switches to crawl through the links on the application’s root page.

-r – Enables recursive downloading.
--mirror – Sets infinite recursion depth and other useful options.

In total, 543MB of non-article HTML and 2.1GB of article HTML were generated. These are reasonable sizes, given how many URLs were crawled in total, but they were still a bit unwieldy to store. I looked for a solution.

Serving from a compressed archive

I knew a couple of data facts about the generated HTML:

There was tons of redundancy. The articles shared much of their header and footer markup.
Virtually all of the data consisted of printable characters and whitespace, which means considerably less unique information than 8 bits per byte.

Both of these factors made the HTML a good candidate for archive compression. My first thought was tar+gzip, but tar+gzip compression works on blocks, not files. To extract a single file, you’d need to parse all the data up till that file. A request for the last file in the archive could take 15 to 20 seconds!

Luckily, the zip file format maintains an index of individual files and compresses them individually, which means that single file extraction is instantaneous no matter where in the archive it is located. I opted to use fuse-zip, an extension for FUSE that lets you mount zip files onto the file system. Fully compressed, the 543MB of pages became a 92MB zip archive (83% deflation), and the 2.1GB of articles became a 407MB zip archive (81% deflation).

After everything was finished, I uploaded the newly created HTML archives to a new production server and shut down the old CMS. From there, a decade’s worth of archived articles can be maintained indefinitely for the future.

Self-contained build environments with Vagrant

Roger Chen — Tue, 20 Aug 2013 00:22:54 +0000

Vagrant is a nifty piece of Ruby software that lets you set up Virtual Machines with an unparalleled amount of automation. It interfaces with a VM provider like VirtualBox, and helps you set up and tear down VM’s as you need them. I like it better than Juju because there isn’t as much hand-holding involved, and I like it better than vanilla Puppet because I don’t regularly deploy a thousand VM’s at a time. At the Daily Cal, I’ve used Vagrant to help developers set up their own build environments for our software where they can write code and test features in isolation. I also use it as a general-purpose VM manager on my home file server, so I can build and test server software in a sandbox.

You can run Vagrant on your laptop, but I think that it’s the wrong piece of hardware for the job. Long-running VM batch jobs and build environment VM’s should be run by headless servers, where you don’t have to worry about excessive heat, power consumption, huge amounts of I/O, and keeping your laptop on so it doesn’t suspend. My server at home is set up with:

1000GB of disk space backed by RAID1
Loud 3000RPM fans I bought off a Chinese guy four years ago
Repurposed consumer-grade CPU and memory (4GB) from an old desktop PC

You don’t need great hardware to run a couple of Linux VM’s. Since my server is basically hidden in the corner, the noisy fans are not a problem and actually do a great job of keeping everything cool under load. RAID mirroring (I’m hoping) will provide high availability, and since the server’s data is easily replaceable, I don’t need to worry about backups. Setting up your own server is usually cheaper than persistent storage on public clouds like AWS, but your mileage may vary.

Vagrant configuration is a single Ruby file named Vagrantfile in the working directory of your vagrant process. My basic Vagrantfile just sets up a virtual machine with Vagrant’s preconfigured Ubuntu 12.04LTS image. They offer other preconfigured images, but this is what I’m most familiar with.

# Vagrantfile

Vagrant.configure("2") do |config|
  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "precise32"

  # The url from where the 'config.vm.box' box will be fetched if it
  # doesn't already exist on the user's system.
  config.vm.box_url = "http://files.vagrantup.com/precise32.box"

  config.vm.network :forwarded_port, guest: 8080, host: 8080

  # Enable public network access from the VM. This is required so that
  # the machine can access the Internet and download required packages.
  config.vm.network :public_network

end

For long-running batch jobs, I like keeping a CPU Execution Cap on my VM’s so that they don’t overwork the system. The cap keeps the temperature down and prevents the VM from interfering with other server processes. You can add an execution cap (for VirtualBox only) by appending the following before the end of your primary configuration block:

# Adding a CPU Execution Cap

Vagrant.configure("2") do |config|
  ...

  config.vm.provider "virtualbox" do |v|
    v.customize ["modifyvm", :id, "--cpuexecutioncap", "40"]
  end

end

After setting up Vagrant’s configuration, create a new directory containing only the Vagrantfile and run vagrant up to set up the VM. Other useful commands include:

vagrant ssh — Opens a shell session to the VM
vagrant halt — Halts the VM gracefully (Vagrant will connect via SSH)
vagrant status — Checks the current status of the VM
vagrant destroy — Destroys the VM

Finally, to set up the build environment automatically every time you create a new Vagrant VM, you can write provisioners. Vagrant supports complex provisioning frameworks like Puppet and Chef, but you can also write a provisioner that’s just a shell script. To do so, add the following inside your Vagrantfile:

Vagrant.configure("2") do |config|
  ...

  config.vm.provision :shell, :path => "bootstrap.sh"
end

Then just stick your provisioner next to your Vagrantfile, and it will execute every time you start your VM. You can write commands to fetch package lists and upgrade system software, or to install build dependencies and check out source code. By default, Vagrant’s current working directory is also mounted on the VM guest as a folder named vagrant in the file system root. You can refer to other provisioner dependencies this way.

Vagrant uses a single public/private keypair for all of its default images. The private key can usually be found in your home directory as ~/.vagrant.d/insecure_private_key. You can add it to your ssh-agent and open your own SSH connections to your VM without Vagrant’s help.

Even if you accidentally mess up your Vagrant configuration, you can use VirtualBox’s built-in command-line tools to fix boot configuration issues or ssh daemon issues.

$ VBoxManage list vms
...
$ VBoxManage controlvm  pause|reset|poweroff|etc
...
$ VBoxHeadless -startvm  --vnc
...
(connect via VNC)

The great thing about Vagrant’s VM-provider abstraction layer is that you can grab the VM images from VirtualBox and boot them on another server with VirtualBox installed, without Vagrant completely. Vagrant is a excellent support tool for programmers (and combined with SSH tunneling, it is great for web developers as well). If you don’t already have support from some sort of VM infrastructure, you should into possibilities with Vagrant.

Setting up and testing two bridged wifi routers

Roger Chen — Sun, 11 Aug 2013 05:23:06 +0000

The walls and microwaves of my house have always conspired to cripple the wifi signal in some rooms, especially the ones downstairs and the backyard. I recently got another wifi router to expand the range. They are daisy-chained from the modem with ethernet cables. My servers and printers are connected to the router in the middle, so it takes responsibility for forwarding ports for virtual services and static IP/MAC bindings. The router at the end of the chain is just there for range. I’m just going to quickly document how I set this up and tested it.

I set up the routers through their web interfaces over ethernet on my laptop. Here are some things to double check before you hook up the devices:

The secondary router is set to receive its WAN configuration from DHCP. I tried a static configuration, but it refused to connect for reasons unknown.
If you need to migrate settings (especially between routers of different models/brands), take down all the configuration settings beforehand, including forwarded services, IP/MAC bindings, DHCP and subnet ranges, QoS, static routing, if you’re using them, etc.

After the devices are set up and hooked up in their proper positions, perform a quick AP scan with your wireless card:

$ sudo iwlist wlan0 scan
wlan0   Scan completed:
        Cell 01 - Address:  XX:XX:XX....
                  Channel:  ...

There should be 2 (or more) access point results that correspond to each of your routers. Configure your local wireless card to connect to each router in turn by specifying its MAC address in your OS’s configuration. Run diagnostics and make sure the connection is operational:

$ ip addr
... (an assigned address on the correct subnet) ..
$ curl ifconfig.me
... (your public IP) ..

That’s it. Now breathe easier knowing you can watch Netflix in the yard without suffering degraded streams. Hoorah.

Signing your own wildcard SSL/HTTPS certificates

Roger Chen — Tue, 06 Aug 2013 00:55:22 +0000

There has been a lot of concern about online privacy in the past few weeks, and lots of people people are looking for ways to better protect themselves on the Internet. One thing you can do is to create your own HTTPS/SSL Certificate Authority. I have a bunch of websites on RogerHub that I want to protect, but I am the only person who needs HTTPS access, since I manage all of my own websites. So, I’ve been using a self-signed wildcard certificate that’s built into my web browser to access my websites securely. You can do this too with a few simple steps:

First, you will need to generate a cryptographic private key for your certificate authority (CA):

$ openssl genrsa -out rootCA.key 2048

Certificate authorities in HTTPS have a private key and a matching public CA certificate. You should store this private key in encrypted storage, because you will need it again if you ever want to generate more certificates. Next, you will need to create a public CA certificate and provide some information about your new CA. If you are the sole user of your new CA, then this information can be set to whatever you want:

$ openssl req -x509 -new -nodes -key rootCA.key -days 9999 -out rootCA.pem

Let me explain a bit about these two commands. The openssl command line utility takes a subcommand as its first argument. The subcommands genrsa and req are used to create RSA keys and to manipulate certificate signing requests, respectively. Normally when you go out and purchase an expensive SSL certificate, you will generate a key yourself and create a Certificate Signing Request (CSR), which is given to your certificate authority to approve. The CA then sends back an public SSL certificate that is presented to website clients. Generating the key yourself means that the private key never passes through the hands of your CA, so only you have the ability to authenticate using your SSL certificate.

The x509 switch in the argument above signifies that you are trying to create a new Certificate Authority, not a Certificate Signing Request. The nodes switch actually means no DES, which means that the resulting certificate will not be encrypted. In this case, DES encryption of the certificate is not necessary if you are the only party involved.

You have just created a new Certificate Authority key and certificate pair. At this point, you can import the rootCA.pem certificate into your web browser and instruct it to trust it for identifying websites. From then on, your browser will accept any website certificate that is signed by your new CA. Now you’re ready to make some website certificates. Create a private key and Certificate Signing Request for your new website as follows:

$ openssl genrsa -out SITE.key 2048

....

$ openssl req -new -key SITE.key -out SITE.csr

....

> Common Name ... : (put the name of your domain here)

Remember to put the domain name of your website as the common name when it asks you for it. Once you’ve completed the certificate signing request, you need to use your new Certificate Authority to issue a certificate for this new key:

$ openssl x509 -req -days 9999 -in SITE.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -out SITE.crt

You’ve created a functional SSL certificate and keypair for your website! You can configure your web server to use your new site key and certificate, and it will begin serving resources over HTTPS. However, this certificate will only work for the domain that you specified before in your CSR. If you want to create a single certificate that will work for multiple domains (called a wildcard certificate or a multi-domain certificate), you will need some more steps. Create a file named SITE.cnf, and put the following inside:

[req_distinguished_name]
countryName = Country Name (2 letter code)
stateOrProvinceName = State or Province Name (full name)
localityName = Locality Name (eg, city)
organizationalUnitName	= Organizational Unit Name (eg, section)
commonName = Common Name (eg, YOUR name)
commonName_max	= 64
emailAddress = Email Address
emailAddress_max = 40

[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req

[v3_req] 
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names

[alt_names]
DNS.1 = rogerhub.com
DNS.2 = *.rogerhub.com

Under the last block, you can insert as many domains with wildcards as you want. To do this, use the following command to generate your CSR:

$ openssl req -new -key SITE.key -out SITE.csr -config SITE.cnf

Now, run the following to generate your wildcard HTTPS certificate, instead of the last command above:

$ openssl x509 -req -days 9999 -in SITE.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -extensions v3_req -out SITE.crt -extfile SITE.cnf

The v3_req block above is a HTTPS extension that allows certificates to work for more than one website. One of the flags in the certificate creation command is CAcreateserial. It will create a new file named rootCA.srl whose contents are updated every time you sign a certificate. You can use CAcreateserial the first time you sign a website certificate, but thereafter, you will need to provide that serial file when you sign more certificates. Do this by replacing -CAcreateserial with -CAserial rootCA.srl in the final command. A lot of the concepts here only hint at the greater complexity of HTTPS, openssl, and cryptography in general. You can learn more by reading the relevant RFCs and the openssl man pages.

Backing up my data as a linux user

Roger Chen — Tue, 16 Jul 2013 23:10:19 +0000

It’s a good habit to routinely back up your important data, and over the past few years, dozens of cloud storage/backup solutions have sprung up, many of which offer a good deal of free space. Before you even start looking for a backup solution, you need to sit down and think about what kind of data you’re looking to back up. Specifically, how much data you have, how often it is changed, and how desperately do you need to keep it safe?

I have a few kinds of data that I actively backup and check for integrity. (Don’t forget to verify your backups, or there isn’t any point in backing them up at all.) Here are all the kinds of data that might be on your computer:

Program code – irreplaceable, small size (~100MB), frequently updated
Documents, and personal configuration files – irreplaceable, small size (~100MB), regularly updated
Personal photos – mostly irreplaceable, large size (more than 10GB), append only
Server configuration and data – mostly irreplaceable, medium size (~1GB), regularly updated
Collected Media – replaceable if needed, medium size (~1GB), append only
System files – easily replaceable, medium size (~1GB), sometimes updated

Several backup solutions try to backup everything. This is not a good idea. First, there are a lot of files on your computer that are easily replaceable (system files) and others that you’d rather not keep in your backup archives (program files). Second, those solutions have no way of giving extra redundancy to the things that matter most, and less redundancy to things that matter less.

In addition to these files, here are some types of data that you might not usually think about backing up:

Email
RSS and Calendar data
Blog content
Social networking content

My backup solution is a mix consisting of free online version control sites, Google, Dropbox, and a personal file server. My code, documents (essays, forms, receipts), and configuration (bash, vim, keys, personal CA, wiki, profile pictures, etc..) are the most important part of my backup. I sync these with Insync to my Google Drive, where I’ve rented 100GB of cloud storage. My Google Drive is regularly backed up to my personal file server, with about 2 weeks of retention.

Disks and old computers are cheap. Get a high-availability file server set up in your home, and you can happily offload intensive tasks to it like virtual machines, backup services, and archival storage. Mine is configured with:

Two 1TB hard drives configured in RAID-1 mirroring
Excessive amounts of ram and processing power, for a headless server
Ubuntu Server installed with SMART monitoring, nagios, nginx (for some web services), a torrent client
Wake-on-lan capabilities

Backing up your Google Drive might sound funny to you, but it is a good precaution in case anything ever happens to your Google Account. Additionally, most of my program code is in either a public GitHub repository or a private BitBucket repository. Version control and social coding features like issues/pull requests give you additional benefits than simply backing up your code, and you should definitely be using some kind of VCS for any code you write.

For many of the projects that I am actively developing, I only use VCS and my file server. Git object data should not be backed up to cloud storage services like Google Drive because they change too often. My vim configuration is also stored on GitHub, to take advantage of git submodules for my vim plugins.

My personal photos are stored in Google+ Photos, formerly known as Picasa. They give you 15GB of shared storage for free, and if that’s not enough, additional space is cheap as dirt. My photos don’t have another level of redundancy like my code and configuration files do. They are less important to me, and Google can be trusted to sustain itself longer than any backup solution you create yourself.

I host a single VPS with Linode (that’s an affiliate link) that contains a good amount of irreplaceable data from my blogs and other services I host on it. Linode itself offers cheap and easy full-disk backups ($5/mo.) that I signed up for. Those backups aren’t intended for hardware failures so much as human error, because Linode already maintains high-availability redundant disk storage for all of its VPS nodes. Additionally, I backup the important parts of the server to my personal file server (/etc, /home, /srv, /var/log), for an extra level of redundancy.

Any pictures I collect from online news aggregators is dumped in my Google Drive and shares the same extra redundancy as my documents and personal configuration files. Larger media like videos are stored in one of my USB 3.0 flash drives, since they are regularly created and deleted.

I don’t back up system files, since Xubuntu is free and programs are only 1 package-manager command away. I don’t maintain extra redundancy for email for the same reason I don’t for photos.

A final thing to consider is the confidentiality of your backups. Whenever you upload data to a free public cloud storage service, you should treat the data as if it were being anonymously released to the public. In other words, personal data, cryptographic keys, and passwords should never be uploaded unencrypted to a public backup service. Things like PGP can help in this regard.

Online advertising and why adblock is not the solution

Roger Chen — Thu, 11 Jul 2013 01:21:13 +0000

There has been some pretty shocking stuff happening in the realm of online advertising recently. By recently, I mean anywhere from just this week to a year and a half ago. Let’s revisit some of them:

A brief recap of this year in advertising

We saw the introduction of the Do Not Track HTTP Header, which at the time of writing hasn’t yet moved from its initial draft state. The draft proposes a new header, sent with every request, that would enable a user to opt-out of all Internet tracking. If adopted, it would greatly hinder the effectiveness of targeted ads. Not long after the draft was announced, Microsoft announced that Internet Explorer 10 would send the DNT header by default as part of their privacy-oriented image. This led to a two week ad-tracking arms race between IE 10 and various websites and website software, most notably apache’s web server, who chose to specifically ignore the header when it came from from IE 10 clients. The draft was soon revised to make this behavior nonconforming.

In an effort to encourage responsible advertising, Adblock Plus 2.0 introduced the Acceptable Ads system that allowed certain non-obtrusive advertising through its filter. The list started with a motion to whitelist ads on reddit. A minority of users are concerned about commercial influence within the project.

Then, Mozilla announced that Firefox 22+ would block third-party cookies by default, another strike against advertisers. More specifically, the ban applies to domains that haven’t already established cookies (instead of a ban blocking all third-party cookies). The cookie ban would cripple many ad-targeting techniques that rely on third-party tracking of user behavior.

Finally, it was revealed just last week that Google allegedly sponsors Adblock Plus’s acceptable ads program in order to get some of their own text ads whitelisted, raising concern about market monopolization.

The controversy

Advertising has always been a controversial part of the Internet. Modern internet architects are mostly science and engineering geeks and harbor a innate predisposition against commercial uses for the world wide web, in very much the same way the astronomers of the 1960’s were disgusted by the thought of a militarized space program.

The reverse is also true: corporate executives and run-of-the-mill internet users today have no idea how much insane power their server administrators hold. Essentially, a lot of miscommunication happens between those who have the power to make decisions and those who have the power to carry them out.

A quick note about adblock and ads

I will use the term adblock in the rest of this post to refer to any one of the several advertisement-blocking browser plugins available. I will use ads and advertising to refer to graphical and text ads (banner ads, skyscrapers, rectangles) as well as video ads, and by extension, certain kinds of advertising on physical media.

Adblock illustrates the tragedy of the commons. Most people are aware that if everybody used adblock, parts of the internet would die off from lack of funding. Nonetheless, many people block ads anyway. The option is tempting, and a single extra user with adblock will likely not make a difference.

I’d like a chance to convince you not to use adblock.

I won’t make any pleas about content publishers deserving money or anything, because those are honestly not good arguments. I want to tell you about a few reasons why ads are not all bad and why you might not want to block ads.

1. Ads pay for the Internet

Everybody loves free online websites, but if Google didn’t make hundreds of billions a year in online advertising revenue, it would have no way to run its services, pay its employees, and support all of its public projects. Sure, you might agree to a $10/month fee to help out some of your favorite sites, but there are a few reasons why this wouldn’t work long-run:

Zero-cost websites provide a ton of social, business, and infrastructural tools that a majority of the population in developed nations have learned to use. Free internet services like those provided by Google are responsible for an enormous part of our country’s productivity and GDP (as well as those of other countries, both developed and undeveloped), and it is because they are free.

The free-to-use nature of these services encourages adoption, which is far more important than immediate profit because suddenly, your average workforce employee knows how to use corporate-grade calendar, email, and spreadsheet software and depend on them as basic communication and organizational tools. These tools not only augment their own human capital, but also created the demand for internet access that fueled the construction of the global internet infrastructure we have today.

Simply put, advertisements are where the internet draws the money to sustain itself and perpetuate this whole cycle of economic development. And because this free-with-ads online publishing model has proven itself to work so well, we now have free photo-sharing sites, news sites that publish exclusively online, and free blogging platforms. Every time you block ads, these are the guys you’re hurting.

You aren’t, of course, obligated to look at ads in order to support these websites, but it’s a small price to pay for all the free stuff you get in return. If you value a free service, you should not block their ads. If you want to support a content publisher, whether it’s a YouTube vlogger or the admins of your regular news aggregator, you should not block ads.

2. Servers aren’t cheap

One thing people usually cite when asked about why they support adblock is the greediness of publishers and advertisers. Most people don’t understand why websites, which are virtual and intangible, cost so much to run in comparison to tangible goods. A large part of this misconception is due to the relative inexpensiveness of commodity consumer hardware.

The hardware that powers your $600 Costco desktop computer is not the same stuff that runs in most datacenters. You might have an exorbitant 16 gigabytes of RAM and a terabyte of disk space, but these devices are consumer-grade devices and would, in most cases, be inappropriate for a server (not counting those who believe in cheap commodity hardware servers). Server hardware often contains extra features to improve their reliability and longevity that aren’t present in consumer hardware. These include error correcting codes that can automatically fix memory corruption, redundant file systems that require several independent hard disks, and specialized CPUs with support for multi-CPU configurations and 24/7 continuous usage.

A server with numerically equal capacity as your home desktop might cost $600 a month to operate, including the massive network infrastructure, housing, HVAC, redundant power, and maintenance staff. Your household computer, on the other hand, doesn’t have to run continuously day-and-night, and it isn’t important if a few bits of memory get flipped accidentally or if it breaks down after two years.

Some websites that provide online services are big enough that they have to rent out enterprise-class equipment, but not large or important enough that they can start billing their customers. These guys who get stuck in the middle often have no other choice but to sell ad space to pay the bills. Websites aren’t exactly cash cows either. In fact, most big websites spend their first couple years in the red. (Tumblr reportedly only made $14 million in revenue in 2012 while spending $25 million in operating costs.)

3. Some advertisements are actually useful

If you can honestly say that online advertising has never ever contributed any value to your life, then:

You probably don’t buy things on the Internet, and
You’re probably not worth very much to advertisers anyway.

That’s totally fine, but you should understand that a majority of internet users do buy things regularly, whether it’s online at Amazon.com or in person at Safeway. Advertisers run ads for them, not for you.

Ads themselves do a few different things in a few different ways. Some ads are political in nature, but the vast majority are commercial advertisements designed to get you to buy or donate. Ads cost a lot of money, so they need to earn money in return. That’s just how it works. There are a few ways that advertisers go about this:

Some ads try to promote brand awareness. Companies sponsor these ads to convince you to choose them over their competition—imagine an advertisement for Coke.
Some ads announce reasons why you should buy something immediately (seasonal sales, special offers, advantages, testimony, etc.).

These are generic ads that usually don’t provide much value to the consumer (that’s you), but make up the majority of ads shown anyway. This is usually because they apply equally well to everybody. (Generic ads also encompass the pop-up and pop-under ads that are just plain annoying.) You may have also seen:

Ads for online education websites or study tools placed on Sparknotes, or
Ads for public cloud solutions or VPS services placed on a blog for system administrators, or
Ads for sword replicas or a fantasy RPG placed on a fan wiki for your favorite fantasy novel series.

These are known as contextual ads and are shown because they’re related to the content on whatever page you’re visiting. (The examples were taken from my open browser tabs..) They are slightly more valuable than generic ads, because they fit in with the content on the page. And finally, you may have before encountered this special kind of ad:

An advertisement for motherboards and microcontrollers you might buy, because you looked at a few on Amazon last week.

Ads like this one are targeted ads. They range from a subtle “hey these gadgets could be cool to have” to a more pronounced “you put these earbuds in your shopping cart, do you still want them??”, depending on how they were targeted. Targeted ads provide excellent value, because most of the time, they’re shown after you have already demonstrated a commercial interest in some sort of product or service, like a music album or a haircut. Once you’re interested, most companies believe you only need a little nudging before you’ll buy something.

(It’s true that targeted ads might lead you to spend more money, but if you’re having problems being frugal, blame your lack of self-control, not the advertisers.)

The ad industry today is pushing every which way they can to adopt ad targeting in all channels, and for a good reason too: the more value an ad has to you, the more value it has to the advertiser. However, a lot of people are afraid of targeted ads because suddenly, advertisers are tracking everything you buy, everything you search online, and every website you visit in an attempt to figure out what you’re interested in buying.

The truth is that most everything you do online is already being tracked. However, the data recorded is usually scattered in a bunch of different places and is never analyzed for trends. Websites maintain logs of their visitors. Department stores keep databases of purchase history (especially if you’ve got their loyalty rewards card). The means to analyze your habits already exist and have existed for a long time. It’s ultimately up to you to use incognito browsing if you don’t want certain searches or websites to be used for targeting.

The fewer generic ads and the more targeted ads we have, the fewer gimmicks are needed to get your attention and the better our advertisement experience gets (we’ve already come a long way from pop-ups). In an ideal world, we wouldn’t need ads at all. However, reality doesn’t work like that, and most people agree that pure HTML Amazon ads are much better than flashing pop-up ads with auto-playing audio.

Still not convinced?

Like I said before, it’s up to you whether you want to use ad blocking software or not, but hopefully you now at least see why advertising is not entirely bad and why it is important to have. If you’re a computer security enthusiast like me, you’ll ditch ad blockers exclusively because of the fact that they hook onto every web page. If not, just stop complaining about advertisements, because things would be a lot worse without them.

Email Basics: The Internet Message Format

Roger Chen — Wed, 26 Jun 2013 19:25:34 +0000

Note: This is part 2 of a two-part segment on Email Basics. You can also read part 1 of this segment, which deals with the Simple Mail Transfer Protocol.

In part 1 of this two-part segment, I mentioned that the goal of SMTP was to ensure the responsible delivery of a digital package, but also that SMTP does not care what kind of data the package contains. In fact, a good implementation of SMTP should accept any sequence of bits as a valid package so long as it ends in ., which is 0D 0A 2E 0D 0A in hexadecimal. In practice, this package of data usually follows the RFC822 standard or its successors, which define the format of “Internet Text Messages”, more commonly thought of as Email.

RFC822 and SMTP were designed to be very robust protocols. They were designed to:

keep memory usage contained in the case of Email servers that may run for thousands of days
be human-readable for debugging purposes
support backward-compatibility as new features are added

These properties will become clear as we talk about RFC822. Before I start, you should know that you can view the full, original RFC822 version of any email you have ever received. These source messages are available in Gmail with the “Show Original” dropdown menu option and in Yahoo Mail with the “Show Headers” option. If you open up one of your emails, you will notice a few things:

The actual email is at the bottom. The top has a bunch of Key: Value pairs, some of which you recognize and all of which are in English.
At the bottom, your email is repeated twice. One version looks like a normal text email, but the other one has a bunch of HTML-like tags embedded in it.
(If there are any attachments) There is a huge chunk of gibberish at the bottom.

(If you’ve ever worked with HTTP headers, then Email headers should look very familiar to you. They work mostly the same way, with some small differences.)

Email headers are bits of information that describe an email. They appear at the top of an email in Key: Value pairs. They can span more than 1 line by prefixing the continuing lines with one or more whitespace characters. Email headers work on the Collapsing White Space (CWS) idea, like HTML does. Any number of spaces, tabs, CRLF’s, or sequences of whitespace are treated as a single space character. Email headers include the familiar From:, To:, Subject:, and Date: headers, but they also include these important pieces of information: (try to find these in your own emails!)

Message-ID – Uniquely identifies the email so that we can have things like threads using emails. They look like email addresses, but have a more restrictive syntax.
References or In-Reply-To – List of Message-IDs that this email is related to. Used for making threads.
Received – This header is added on to the beginning of the email headers whenever the email changes hands (by means of SMTP or other means). It contains information about the receiving server and the sending server, as well as the time and protocol variables. The newest ones are always at the top, not the bottom.
DKIM-Signature – A PKI signature by the Email’s sender’s mail server that guarantees the email has not been tampered with since it left its origin. More on this later.
X-* – Any header that begins with X- is a non-standard header. However, non-standard headers may serve some very important functions. Sometimes, standards begin as X- non-standard headers and slowly make their way into global adoption.

After the email headers is a completely empty line. This marks the border between the headers and the email body. Most emails today are transferred using the multipart/alternative content-type. The body of an email is sent in 2 forms: one is a plain text version that can be read on command-line email clients like mutt, and the other is a HTML version that provides greater functionality in graphical email clients like a web application or desktop email application. It is up to the email program to decide which one to show.

Applications are typically encoded in Base64 before they can be attached to an email. Base64 contains 64 different characters to represent data: lowercase and uppercase letters, numbers, and the + and / symbols. Attachments just follow as another block in the multipart/alternative encoding of email bodies. (Because of this, attachments are actually 4 times as large as their originals when they are sent with an email.)

Here’s an example RFC822 email that demonstrates everything up to now:

Received: by castle from treehouse for
    ; Tue, 25 Jun ...
Message-ID: 
Date: Tue, 25 Jun 2013 11:29:00 +0000
Subject: Hey!
Return-Path: Neptr 
From: Finn the Human 
To: Lemongrab 
Content-Type: multipart/alternative; boundary=front

--front
Content-Type: text/plain; charset=UTF-8

mhm.. mhmm...
You smell like dog buns.

--front
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

mhm.. mhmm...

You smell like dog buns.

--front--

When your mail server receives an email, it has no way of knowing if the sender is really who he or she claims to be. All the server knows is who is currently delivering the email. To help discourage people from forging these email headers, two infrastructural standards were developed: SPF and DKIM.

Both of these standards depend on special pieces of information made available in a domain’s DNS records. SPF (Sender Protection Framework) lets a domain define which IP addresses are permitted to deliver email on its behalf. (This works because, in reality, a TCP connection is usually established directly between the origin and the destination servers without any intermediate servers.) Email from a server that isn’t in a domain’s SPF record is considered suspicious. Gmail displays the result of a successful SPF verification by saying an email was Mailed by: ... a domain.

To see a server’s SPF record, perform the following:

$ nslookup -q=txt tumblr.com
> ...

DKIM is public-key cryptography applied to email. A server releases one or more public keys through its DNS records. Each public key is available through a TXT DNS lookup at key._domainkey.example.com.

$ nslookup -q=txt 20120113._domainkey.gmail.com
> 20120113._domainkey.gmail.com	text = "k=rsa\; p=MIIB
> ...

The origin server declares which headers are “signed headers” and then applies its signature to both the signed headers and the email body. Gmail displays the result of a successful DKIM verification by saying an email was Signed by: ... a domain.

It is important to note that any Internet-connected computer can act as a origin server for email. However, only servers designated by a DNS MX record can be configured to act as the receiving server. When you are designing your own email-enabled web applications, it is important to keep these authentication mechanisms in mind to ensure that your email comes only from trustworthy SMTP origin servers and that you take proper actions to prevent your emails from being classified as spam.

Email Basics: Introduction to SMTP

Roger Chen — Tue, 25 Jun 2013 04:01:49 +0000

Note: This is part 1 of a two-part segment on Email Basics. You can also read part 2 of this segment, which deals with the internal format of emails.

With all of the government spying scares circulating recently, many people have started taking another look at the companies and people responsible for their email. Like most technology today, email is easy to use, but difficult to understand, and despite email’s widespread adoption, most people only have the vague notion of massive computers routing messages over some obscure protocol. It seems to be getting harder and harder to trust a third party with something as personal and sensitive as your email account, but a lot of this mistrust comes from a fundamental lack of understanding of how email works.

Email runs parallel and independently of the World Wide Web, or what we’ve come to think of as the Internet. In reality, websites only make up one of many different application-level protocols that make up the whole of all Internet traffic. Email is actually made up of 2 separate protocols that work together, but were designed to be able to operate independently. These are: SMTP, which describes how email is passed along to its destination, and RFC822, which describes the abilities and format of email. These two are Internet Standards, published by the IETF (Internet Engineering Task Force), and anybody who wants to be able to send or receive email follows them.

The SMTP protocol is defined in a document known as RFC821 (which is right before RFC822). I left RFC822 for part 2 of this blog series and will focus on describing SMTP here.

SMTP is responsible for ensuring delivery of a package. It couldn’t care less about what that package contained or how it was structured, so long as it was delivered. SMTP runs on the TCP protocol, which is also used for things like websites. The TCP protocol is able to distinguish between somebody requesting a website and something sending an email by assigning to each of its applications one or more port numbers. SMTP has been assigned port 25.

The nifty thing about TCP is that provides a lot of things for free and takes care of all the messy business. Some of these things include:

Sending messages of any size
Guaranteeing that everything you send will be delivered
Providing 2-way reliable communication without stuff getting messed up in the middle

For this reason, a lot of Internet stuff is built on top of TCP to take advantage of all the existing infrastructure that supports it. So what goes through TCP? Well, just about anything. You can send anonymous encouragement like so:

   You: $(nc g.co 80)
   You: Hey you! You're the best!
Server: ...

You can receive messages back in the same way. In this example, there is a distinction between the client (You) and the server (computer that receives your messages). However, TCP is completely symmetric once the initial connection is established: both you and the server can send data, receive data, and close the connection.

The SMTP Conversation

Package delivery through SMTP works like a conversation. This conversation takes place between a client, who has a package to deliver, and a server, who is receiving the package. Once the package is received by the server, the client can forget about it and the server assumes full responsibility for the package and its delivery. In this way, servers sometimes become clients themselves if they need to forward packages on to other servers.

Like any good conversation, SMTP begins with an introduction by both parties:

Server: 220 localhost Greetings
   You: HELO roger
Server: localhost

In case you’ve got bad memory, the server gives you its name twice. Once when you open the connection, and again after you introduce yourself with a HELO. The names used here are special names known as hostnames. In real-world SMTP, these hostnames are usually fully-qualified domain names (FQDN’s) and look like mail.rogerhub.com.

You’ve both said helo. Now we get down to business. The usual thing to do at this point is to tell the server about the package you’re delivering. Of course, if you’re shy you can just walk away.

   You: QUIT
Server: 221 Service closing transmission channel

To declare that you’ve got a package to deliver, the MAIL FROM command is used. The command is followed with a single colon and then the email address of the sender.

   You: MAIL FROM:Finn the Human 
Server: 250 OK

Next, you declare the recipients of the package with RCPT TO. If there are more than one, which happens often, then you declare each recipient one at a time.

   You: RCPT TO:Jake the Dog 
Server: 250 OK
   You: RCPT TO:BMO 
Server: 250 OK

So far so good. We are not far from the end. Once the formalities are exchanged, you can begin delivering the package with the DATA command. After you’ve delivered the whole package, you tell the server that you’re done by sending a single period on its own line.

   You: DATA
Server: 354 Start mail input; end with .
   You: Jake,

        Left the house for a few hours. Will be back
        soon.

        Not joking,
        Finn
        .
Server: 250 OK

That’s it! At this point, you can say goodbye with QUIT or send another email. (These command responses are straight from a demo SMTP server implementation on my GitHub.)

But this can’t be all there is! Is there more to SMTP? You bet there is!

In fact, SMTP and RFC822 have both seen many revisions to the specification over the years. One of the major changes to the specification is in the initial greeting. Instead of HELO, an alternate greeting EHLO, or extended helo, was proposed. When you greet an EHLO-aware server with EHLO, it will send you an extended SMTP (or ESMTP) response like so:

Server: 220 localhost Greetings
   You: EHLO roger
Server: 250-localhost
Server: 250-PIPELINING
Server: 250 SIZE 10000000

The server sends a list of SMTP extensions that it supports. These include things like STARTTLS, which provides a way to upgrade a plaintext connection (like all the transmissions seen here) to an anonymous, encrypted TLS session, which is more resilient to snooping. Both the client and the server have to support an extension before you can use it. Extensions open up the path to a some extra-cool stuff like authentication and UTF-8 support.

You can try out this whole process by sending an email to a friend, just like a real mail server would do. The first step is to find the IP address of the server to contact! The hostname of your request is the hostname of the email address (e.g. gmail.com), and the query type is mx, which stands for Mail Exchanger.

$ nslookup -q=mx gmail.com
...

After that, you just need to open a TCP connection to one of the listed Mail Exchangers. You can do this with either netcat or telnet, both of whose availability depend on your operating system.

$ nc example.com 25       # netcat
$ telnet example.com 25   # telnet

Fire away! (But don’t be surprised if your phony emails are discarded as spam.. Haha)