Code.RogerHub » Infrastructure

Writing essays in VIM and Latex

Roger Chen — Fri, 27 Mar 2015 23:03:45 +0000

I’m only taking one humanities course this semester, so I don’t have that many essays to write. But for the few assignments in that course, I’m planning to use VIM and Latex to write my essays. I have Pages.app installed on my laptop, and I already use Google Docs for stuff, but I’m still more comfortable writing in VIM, even if I’m not writing code.

Smart quotes

Latex handles smart quotes like you’d expect it to. If there are single-quotes or double-quotes in the text section of your document, they’ll get turned into left and right quotes in the final PDF.

Spell checking

I just use VIM’s built-in spell-checking feature. Make sure you use it with MacVIm/GVIM, or else it looks terrible. Apparently there’s a way to bring up the correction suggestions list too, if you look for it.

Rich formatting

You can do bold and italics and lists in Latex. I like it even better than the WYSIWYG editors, since you can do line manipulation and fancy macro stuff in VIM too.

Instant feedback

You can mix inotify-tools and your Latex compiler to get you basically instant feedback. Every time you save the file, you can get it to re-compile your Latex and Preview.app will reload it. I’m just sharing my home folder through VirtualBox shared folders (configured in Vagrant), so I can work with files directly on the VM.

MongoDB power usage on laptop

Roger Chen — Fri, 27 Mar 2015 22:51:55 +0000

I’m setting up a Chef-based Ubuntu VM on my laptop, so I can get sandboxing and all the usual Ubuntu tools. Most things work on OS X too, but it’s a second-class citizen compared to poster child Ubuntu. I have Mongo and a few services set up there, for school stuff, and I noticed that Mongo regularly uses a significant amount of CPU time, even when it’s not handling any requests.

I attached strace to it with strace -f -p 1056, and here’s what I saw:

[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1112] <... nanosleep resumed> 0x7fd3895319a0) = 0
[pid 1112] nanosleep({0, 34000000}, 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} 
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} 
[pid 1112] <... nanosleep resumed> 0x7fd3895319a0) = 0

I looked online for an explanation, and I found this bug about MongoDB power usage. Apparently, it uses select() as a way to do timekeeping for all the ways Mongo uses wall time. Somebody proposed that alternative methods be used instead, which don’t require this kind of tight looping, but it looks like the Mongo team is busy with other stuff right now and won’t consider the patch.

So I added a clause in my Chef config to disable the MongoDB service from starting automatically on boot. I needed to specify the Upstart service provider explicitly in the configuration, since the mongodb package installs both a SysV script and Upstart manifest.

Taking advantage of cloud VM-driven development

Roger Chen — Thu, 10 Jul 2014 06:39:48 +0000

Most people write about cloud computing as it relates to their service infrastructure. It’s exciting to hear about how Netflix and Dropbox et al. use AWS to support their operations, but all of those large-scale ideas don’t really mean much for the average developer. Most people don’t have the budget or the need for enormous computing power or highly-available SaaS data storage like S3, but it turns out that cloud-based VM’s can be highly useful for the average developer in a different way.

Sometimes, you see something like this one-liner installation script for Heroku Toolbelt, and you just get nervous:

wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh

Not only are they asking you to run a shell script downloaded over the Internet, but the script also asks for root privileges to install packages and stuff. Or, maybe you’re reading a blog post about some HA database cluster software and you want to try it out yourself, but running 3 virtual machines on your puny laptop is out of the questions.

To get around this issue, I’ve been using DigitalOcean machines for when I want to test something out but don’t want to go to the trouble of maintaining a development server or virtual machines. Virtualized cloud servers are great for this because:

They’re DIRT CHEAP. DO’s smallest machine costs $0.007 an hour. Even if you use it for 2 hours, it rounds down to 1 cent.
The internet connection is usually a lot better than whatever you’re using. Plus, most cloud providers have local mirrors for package manager stuff, which makes installing packages super fast.
Burstable CPU means that you can get an unfair amount of processing power for a short time at the beginning, which comes in handy for initially installing and downloading all the stuff you’ll want to have on your machine.

I use the tugboat client (a CLI ruby app) to interface with the DigitalOcean API. To try out MariaDB Galera clustering, I just opened up three terminals and had three SSH sessions going on. For source builds that have a dozen or more miscellaneous dependencies, I usually just prepare a simple build script that I can upload and run on a Cloud VM whenever I need it. When I’m done with a machine, I’ll shut it down until I need a machine again a few days later.

Running development virtual machines in the cloud might not drastically change your workflow, but it opens up a lot of opportunities for experimentation and massive computing resources when you want it. So load up a couple dollars onto a fresh DigitalOcean account and boot up some VMs!

Getting a python-like shell for Perl

Roger Chen — Wed, 26 Mar 2014 22:01:48 +0000

When learning a new language, I find it really helpful if I can test out new syntax and language constructs without going through the trouble of creating a file and all the boilerplate along with it. Perl doesn’t have this capability built-in, as far as I know, but there’s this great CPAN module that claims to do the same thing. It’s a short little module (just about 100 SLOC), and it gives you a familiar prompt:

Perl 5.14.2 (Tue Feb  4 23:09:53 UTC 2014) [linux panlong 2.6.42-37-generic #58-ubuntu smp thu jan 24 15:28:10 utc 2013 x86_64 x86_64 x86_64 gnulinux ]
Type "help;", "copyright;", or "license;" for more information.
>>>

It’s not very obvious how to set up the thing though. I installed the module and its dependencies via cpanm, and then created this little snippet in a file called perlthon.pl:

#!/usr/bin/perl
use 5.010;
use strict;
use warnings;

use Perl::Shell;
Perl::Shell::shell();

Then, the shell started right up.

Self-contained build environments with Vagrant

Roger Chen — Tue, 20 Aug 2013 00:22:54 +0000

Vagrant is a nifty piece of Ruby software that lets you set up Virtual Machines with an unparalleled amount of automation. It interfaces with a VM provider like VirtualBox, and helps you set up and tear down VM’s as you need them. I like it better than Juju because there isn’t as much hand-holding involved, and I like it better than vanilla Puppet because I don’t regularly deploy a thousand VM’s at a time. At the Daily Cal, I’ve used Vagrant to help developers set up their own build environments for our software where they can write code and test features in isolation. I also use it as a general-purpose VM manager on my home file server, so I can build and test server software in a sandbox.

You can run Vagrant on your laptop, but I think that it’s the wrong piece of hardware for the job. Long-running VM batch jobs and build environment VM’s should be run by headless servers, where you don’t have to worry about excessive heat, power consumption, huge amounts of I/O, and keeping your laptop on so it doesn’t suspend. My server at home is set up with:

1000GB of disk space backed by RAID1
Loud 3000RPM fans I bought off a Chinese guy four years ago
Repurposed consumer-grade CPU and memory (4GB) from an old desktop PC

You don’t need great hardware to run a couple of Linux VM’s. Since my server is basically hidden in the corner, the noisy fans are not a problem and actually do a great job of keeping everything cool under load. RAID mirroring (I’m hoping) will provide high availability, and since the server’s data is easily replaceable, I don’t need to worry about backups. Setting up your own server is usually cheaper than persistent storage on public clouds like AWS, but your mileage may vary.

Vagrant configuration is a single Ruby file named Vagrantfile in the working directory of your vagrant process. My basic Vagrantfile just sets up a virtual machine with Vagrant’s preconfigured Ubuntu 12.04LTS image. They offer other preconfigured images, but this is what I’m most familiar with.

# Vagrantfile

Vagrant.configure("2") do |config|
  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "precise32"

  # The url from where the 'config.vm.box' box will be fetched if it
  # doesn't already exist on the user's system.
  config.vm.box_url = "http://files.vagrantup.com/precise32.box"

  config.vm.network :forwarded_port, guest: 8080, host: 8080

  # Enable public network access from the VM. This is required so that
  # the machine can access the Internet and download required packages.
  config.vm.network :public_network

end

For long-running batch jobs, I like keeping a CPU Execution Cap on my VM’s so that they don’t overwork the system. The cap keeps the temperature down and prevents the VM from interfering with other server processes. You can add an execution cap (for VirtualBox only) by appending the following before the end of your primary configuration block:

# Adding a CPU Execution Cap

Vagrant.configure("2") do |config|
  ...

  config.vm.provider "virtualbox" do |v|
    v.customize ["modifyvm", :id, "--cpuexecutioncap", "40"]
  end

end

After setting up Vagrant’s configuration, create a new directory containing only the Vagrantfile and run vagrant up to set up the VM. Other useful commands include:

vagrant ssh — Opens a shell session to the VM
vagrant halt — Halts the VM gracefully (Vagrant will connect via SSH)
vagrant status — Checks the current status of the VM
vagrant destroy — Destroys the VM

Finally, to set up the build environment automatically every time you create a new Vagrant VM, you can write provisioners. Vagrant supports complex provisioning frameworks like Puppet and Chef, but you can also write a provisioner that’s just a shell script. To do so, add the following inside your Vagrantfile:

Vagrant.configure("2") do |config|
  ...

  config.vm.provision :shell, :path => "bootstrap.sh"
end

Then just stick your provisioner next to your Vagrantfile, and it will execute every time you start your VM. You can write commands to fetch package lists and upgrade system software, or to install build dependencies and check out source code. By default, Vagrant’s current working directory is also mounted on the VM guest as a folder named vagrant in the file system root. You can refer to other provisioner dependencies this way.

Vagrant uses a single public/private keypair for all of its default images. The private key can usually be found in your home directory as ~/.vagrant.d/insecure_private_key. You can add it to your ssh-agent and open your own SSH connections to your VM without Vagrant’s help.

Even if you accidentally mess up your Vagrant configuration, you can use VirtualBox’s built-in command-line tools to fix boot configuration issues or ssh daemon issues.

$ VBoxManage list vms
...
$ VBoxManage controlvm  pause|reset|poweroff|etc
...
$ VBoxHeadless -startvm  --vnc
...
(connect via VNC)

The great thing about Vagrant’s VM-provider abstraction layer is that you can grab the VM images from VirtualBox and boot them on another server with VirtualBox installed, without Vagrant completely. Vagrant is a excellent support tool for programmers (and combined with SSH tunneling, it is great for web developers as well). If you don’t already have support from some sort of VM infrastructure, you should into possibilities with Vagrant.

elementary OS, a distribution like no other

Roger Chen — Sun, 18 Aug 2013 04:55:48 +0000

There are a surprising number of people who hate elementary OS. They say that elementary is technically just a distribution of linux, not an OS. They say that it is too similar to OS X. They say that the developers are in over their heads. All of these things may be true, but I do not care. I am sick of ascetic desktop environments without animations. I am tired of not having a compositor. I don’t need a dozen GUI applications to hold my hand, but I hate having to fix things that don’t work out of the box. You are not Richard Stallman. Face it: modern people, whether or not they are also computer programmers, do not live in the terminal, and most of us aren’t using laptops with hardware built in 2004. And finally, I am sick of blue and black window decorations that look like they were designed by a 12 year old.

Elementary OS Luna is the first distribution of linux I’ve used where I don’t feel like changing a thing. The desktop environment defaults are excellent, and all of Ubuntu’s excellent hardware support, community of PPA’s, and familiar package manager are available. At the same time, there is a lot of graphical magic, window animation, and attention to detail that is quite similar to OS X. There is a minimal amount of hand-holding, and it’s quite easy to get used to the desktop because of the intuitive keyboard shortcuts and great application integration.

You can’t tell from just the screenshot above, but elementary OS is more than just a desktop environment. The distribution comes packaged with a bunch of custom-built applications, like the custom file manager and terminal app. Other apps like a IRC client, social media client, and system search are available in community PPA’s. I do most of my work through either the web browser, ssh, or vim. Important data on my laptop is limited to personal files on my Google Drive, and a directory of projects in active development that’s regularly backed up to my personal file server. Programs can be painlessly installed from the package manager, and configuration files are symlinked from my Google Drive. I’m not very attached to the current state of my laptop at any given moment because all the data on it is replaceable, with the exception of OS tweaks. I don’t like having to install a dozen customization packages to get my workflow to how I like it, so out-of-box experience is very important to me.

I would say that if you’re a regular linux user, you should at least give elementary OS Luna a try, even if it’s on a friend’s machine or in a VM. You may be surprised. I was.

Setting up and testing two bridged wifi routers

Roger Chen — Sun, 11 Aug 2013 05:23:06 +0000

The walls and microwaves of my house have always conspired to cripple the wifi signal in some rooms, especially the ones downstairs and the backyard. I recently got another wifi router to expand the range. They are daisy-chained from the modem with ethernet cables. My servers and printers are connected to the router in the middle, so it takes responsibility for forwarding ports for virtual services and static IP/MAC bindings. The router at the end of the chain is just there for range. I’m just going to quickly document how I set this up and tested it.

I set up the routers through their web interfaces over ethernet on my laptop. Here are some things to double check before you hook up the devices:

The secondary router is set to receive its WAN configuration from DHCP. I tried a static configuration, but it refused to connect for reasons unknown.
If you need to migrate settings (especially between routers of different models/brands), take down all the configuration settings beforehand, including forwarded services, IP/MAC bindings, DHCP and subnet ranges, QoS, static routing, if you’re using them, etc.

After the devices are set up and hooked up in their proper positions, perform a quick AP scan with your wireless card:

$ sudo iwlist wlan0 scan
wlan0   Scan completed:
        Cell 01 - Address:  XX:XX:XX....
                  Channel:  ...

There should be 2 (or more) access point results that correspond to each of your routers. Configure your local wireless card to connect to each router in turn by specifying its MAC address in your OS’s configuration. Run diagnostics and make sure the connection is operational:

$ ip addr
... (an assigned address on the correct subnet) ..
$ curl ifconfig.me
... (your public IP) ..

That’s it. Now breathe easier knowing you can watch Netflix in the yard without suffering degraded streams. Hoorah.

Signing your own wildcard SSL/HTTPS certificates

Roger Chen — Tue, 06 Aug 2013 00:55:22 +0000

There has been a lot of concern about online privacy in the past few weeks, and lots of people people are looking for ways to better protect themselves on the Internet. One thing you can do is to create your own HTTPS/SSL Certificate Authority. I have a bunch of websites on RogerHub that I want to protect, but I am the only person who needs HTTPS access, since I manage all of my own websites. So, I’ve been using a self-signed wildcard certificate that’s built into my web browser to access my websites securely. You can do this too with a few simple steps:

First, you will need to generate a cryptographic private key for your certificate authority (CA):

$ openssl genrsa -out rootCA.key 2048

Certificate authorities in HTTPS have a private key and a matching public CA certificate. You should store this private key in encrypted storage, because you will need it again if you ever want to generate more certificates. Next, you will need to create a public CA certificate and provide some information about your new CA. If you are the sole user of your new CA, then this information can be set to whatever you want:

$ openssl req -x509 -new -nodes -key rootCA.key -days 9999 -out rootCA.pem

Let me explain a bit about these two commands. The openssl command line utility takes a subcommand as its first argument. The subcommands genrsa and req are used to create RSA keys and to manipulate certificate signing requests, respectively. Normally when you go out and purchase an expensive SSL certificate, you will generate a key yourself and create a Certificate Signing Request (CSR), which is given to your certificate authority to approve. The CA then sends back an public SSL certificate that is presented to website clients. Generating the key yourself means that the private key never passes through the hands of your CA, so only you have the ability to authenticate using your SSL certificate.

The x509 switch in the argument above signifies that you are trying to create a new Certificate Authority, not a Certificate Signing Request. The nodes switch actually means no DES, which means that the resulting certificate will not be encrypted. In this case, DES encryption of the certificate is not necessary if you are the only party involved.

You have just created a new Certificate Authority key and certificate pair. At this point, you can import the rootCA.pem certificate into your web browser and instruct it to trust it for identifying websites. From then on, your browser will accept any website certificate that is signed by your new CA. Now you’re ready to make some website certificates. Create a private key and Certificate Signing Request for your new website as follows:

$ openssl genrsa -out SITE.key 2048

....

$ openssl req -new -key SITE.key -out SITE.csr

....

> Common Name ... : (put the name of your domain here)

Remember to put the domain name of your website as the common name when it asks you for it. Once you’ve completed the certificate signing request, you need to use your new Certificate Authority to issue a certificate for this new key:

$ openssl x509 -req -days 9999 -in SITE.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -out SITE.crt

You’ve created a functional SSL certificate and keypair for your website! You can configure your web server to use your new site key and certificate, and it will begin serving resources over HTTPS. However, this certificate will only work for the domain that you specified before in your CSR. If you want to create a single certificate that will work for multiple domains (called a wildcard certificate or a multi-domain certificate), you will need some more steps. Create a file named SITE.cnf, and put the following inside:

[req_distinguished_name]
countryName = Country Name (2 letter code)
stateOrProvinceName = State or Province Name (full name)
localityName = Locality Name (eg, city)
organizationalUnitName	= Organizational Unit Name (eg, section)
commonName = Common Name (eg, YOUR name)
commonName_max	= 64
emailAddress = Email Address
emailAddress_max = 40

[req]
distinguished_name = req_distinguished_name
req_extensions = v3_req

[v3_req] 
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names

[alt_names]
DNS.1 = rogerhub.com
DNS.2 = *.rogerhub.com

Under the last block, you can insert as many domains with wildcards as you want. To do this, use the following command to generate your CSR:

$ openssl req -new -key SITE.key -out SITE.csr -config SITE.cnf

Now, run the following to generate your wildcard HTTPS certificate, instead of the last command above:

$ openssl x509 -req -days 9999 -in SITE.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -extensions v3_req -out SITE.crt -extfile SITE.cnf

The v3_req block above is a HTTPS extension that allows certificates to work for more than one website. One of the flags in the certificate creation command is CAcreateserial. It will create a new file named rootCA.srl whose contents are updated every time you sign a certificate. You can use CAcreateserial the first time you sign a website certificate, but thereafter, you will need to provide that serial file when you sign more certificates. Do this by replacing -CAcreateserial with -CAserial rootCA.srl in the final command. A lot of the concepts here only hint at the greater complexity of HTTPS, openssl, and cryptography in general. You can learn more by reading the relevant RFCs and the openssl man pages.

Backing up my data as a linux user

Roger Chen — Tue, 16 Jul 2013 23:10:19 +0000

It’s a good habit to routinely back up your important data, and over the past few years, dozens of cloud storage/backup solutions have sprung up, many of which offer a good deal of free space. Before you even start looking for a backup solution, you need to sit down and think about what kind of data you’re looking to back up. Specifically, how much data you have, how often it is changed, and how desperately do you need to keep it safe?

I have a few kinds of data that I actively backup and check for integrity. (Don’t forget to verify your backups, or there isn’t any point in backing them up at all.) Here are all the kinds of data that might be on your computer:

Program code – irreplaceable, small size (~100MB), frequently updated
Documents, and personal configuration files – irreplaceable, small size (~100MB), regularly updated
Personal photos – mostly irreplaceable, large size (more than 10GB), append only
Server configuration and data – mostly irreplaceable, medium size (~1GB), regularly updated
Collected Media – replaceable if needed, medium size (~1GB), append only
System files – easily replaceable, medium size (~1GB), sometimes updated

Several backup solutions try to backup everything. This is not a good idea. First, there are a lot of files on your computer that are easily replaceable (system files) and others that you’d rather not keep in your backup archives (program files). Second, those solutions have no way of giving extra redundancy to the things that matter most, and less redundancy to things that matter less.

In addition to these files, here are some types of data that you might not usually think about backing up:

Email
RSS and Calendar data
Blog content
Social networking content

My backup solution is a mix consisting of free online version control sites, Google, Dropbox, and a personal file server. My code, documents (essays, forms, receipts), and configuration (bash, vim, keys, personal CA, wiki, profile pictures, etc..) are the most important part of my backup. I sync these with Insync to my Google Drive, where I’ve rented 100GB of cloud storage. My Google Drive is regularly backed up to my personal file server, with about 2 weeks of retention.

Disks and old computers are cheap. Get a high-availability file server set up in your home, and you can happily offload intensive tasks to it like virtual machines, backup services, and archival storage. Mine is configured with:

Two 1TB hard drives configured in RAID-1 mirroring
Excessive amounts of ram and processing power, for a headless server
Ubuntu Server installed with SMART monitoring, nagios, nginx (for some web services), a torrent client
Wake-on-lan capabilities

Backing up your Google Drive might sound funny to you, but it is a good precaution in case anything ever happens to your Google Account. Additionally, most of my program code is in either a public GitHub repository or a private BitBucket repository. Version control and social coding features like issues/pull requests give you additional benefits than simply backing up your code, and you should definitely be using some kind of VCS for any code you write.

For many of the projects that I am actively developing, I only use VCS and my file server. Git object data should not be backed up to cloud storage services like Google Drive because they change too often. My vim configuration is also stored on GitHub, to take advantage of git submodules for my vim plugins.

My personal photos are stored in Google+ Photos, formerly known as Picasa. They give you 15GB of shared storage for free, and if that’s not enough, additional space is cheap as dirt. My photos don’t have another level of redundancy like my code and configuration files do. They are less important to me, and Google can be trusted to sustain itself longer than any backup solution you create yourself.

I host a single VPS with Linode (that’s an affiliate link) that contains a good amount of irreplaceable data from my blogs and other services I host on it. Linode itself offers cheap and easy full-disk backups ($5/mo.) that I signed up for. Those backups aren’t intended for hardware failures so much as human error, because Linode already maintains high-availability redundant disk storage for all of its VPS nodes. Additionally, I backup the important parts of the server to my personal file server (/etc, /home, /srv, /var/log), for an extra level of redundancy.

Any pictures I collect from online news aggregators is dumped in my Google Drive and shares the same extra redundancy as my documents and personal configuration files. Larger media like videos are stored in one of my USB 3.0 flash drives, since they are regularly created and deleted.

I don’t back up system files, since Xubuntu is free and programs are only 1 package-manager command away. I don’t maintain extra redundancy for email for the same reason I don’t for photos.

A final thing to consider is the confidentiality of your backups. Whenever you upload data to a free public cloud storage service, you should treat the data as if it were being anonymously released to the public. In other words, personal data, cryptographic keys, and passwords should never be uploaded unencrypted to a public backup service. Things like PGP can help in this regard.

Email Basics: The Internet Message Format

Roger Chen — Wed, 26 Jun 2013 19:25:34 +0000

Note: This is part 2 of a two-part segment on Email Basics. You can also read part 1 of this segment, which deals with the Simple Mail Transfer Protocol.

In part 1 of this two-part segment, I mentioned that the goal of SMTP was to ensure the responsible delivery of a digital package, but also that SMTP does not care what kind of data the package contains. In fact, a good implementation of SMTP should accept any sequence of bits as a valid package so long as it ends in ., which is 0D 0A 2E 0D 0A in hexadecimal. In practice, this package of data usually follows the RFC822 standard or its successors, which define the format of “Internet Text Messages”, more commonly thought of as Email.

RFC822 and SMTP were designed to be very robust protocols. They were designed to:

keep memory usage contained in the case of Email servers that may run for thousands of days
be human-readable for debugging purposes
support backward-compatibility as new features are added

These properties will become clear as we talk about RFC822. Before I start, you should know that you can view the full, original RFC822 version of any email you have ever received. These source messages are available in Gmail with the “Show Original” dropdown menu option and in Yahoo Mail with the “Show Headers” option. If you open up one of your emails, you will notice a few things:

The actual email is at the bottom. The top has a bunch of Key: Value pairs, some of which you recognize and all of which are in English.
At the bottom, your email is repeated twice. One version looks like a normal text email, but the other one has a bunch of HTML-like tags embedded in it.
(If there are any attachments) There is a huge chunk of gibberish at the bottom.

(If you’ve ever worked with HTTP headers, then Email headers should look very familiar to you. They work mostly the same way, with some small differences.)

Email headers are bits of information that describe an email. They appear at the top of an email in Key: Value pairs. They can span more than 1 line by prefixing the continuing lines with one or more whitespace characters. Email headers work on the Collapsing White Space (CWS) idea, like HTML does. Any number of spaces, tabs, CRLF’s, or sequences of whitespace are treated as a single space character. Email headers include the familiar From:, To:, Subject:, and Date: headers, but they also include these important pieces of information: (try to find these in your own emails!)

Message-ID – Uniquely identifies the email so that we can have things like threads using emails. They look like email addresses, but have a more restrictive syntax.
References or In-Reply-To – List of Message-IDs that this email is related to. Used for making threads.
Received – This header is added on to the beginning of the email headers whenever the email changes hands (by means of SMTP or other means). It contains information about the receiving server and the sending server, as well as the time and protocol variables. The newest ones are always at the top, not the bottom.
DKIM-Signature – A PKI signature by the Email’s sender’s mail server that guarantees the email has not been tampered with since it left its origin. More on this later.
X-* – Any header that begins with X- is a non-standard header. However, non-standard headers may serve some very important functions. Sometimes, standards begin as X- non-standard headers and slowly make their way into global adoption.

After the email headers is a completely empty line. This marks the border between the headers and the email body. Most emails today are transferred using the multipart/alternative content-type. The body of an email is sent in 2 forms: one is a plain text version that can be read on command-line email clients like mutt, and the other is a HTML version that provides greater functionality in graphical email clients like a web application or desktop email application. It is up to the email program to decide which one to show.

Applications are typically encoded in Base64 before they can be attached to an email. Base64 contains 64 different characters to represent data: lowercase and uppercase letters, numbers, and the + and / symbols. Attachments just follow as another block in the multipart/alternative encoding of email bodies. (Because of this, attachments are actually 4 times as large as their originals when they are sent with an email.)

Here’s an example RFC822 email that demonstrates everything up to now:

Received: by castle from treehouse for
    ; Tue, 25 Jun ...
Message-ID: 
Date: Tue, 25 Jun 2013 11:29:00 +0000
Subject: Hey!
Return-Path: Neptr 
From: Finn the Human 
To: Lemongrab 
Content-Type: multipart/alternative; boundary=front

--front
Content-Type: text/plain; charset=UTF-8

mhm.. mhmm...
You smell like dog buns.

--front
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

mhm.. mhmm...

You smell like dog buns.

--front--

When your mail server receives an email, it has no way of knowing if the sender is really who he or she claims to be. All the server knows is who is currently delivering the email. To help discourage people from forging these email headers, two infrastructural standards were developed: SPF and DKIM.

Both of these standards depend on special pieces of information made available in a domain’s DNS records. SPF (Sender Protection Framework) lets a domain define which IP addresses are permitted to deliver email on its behalf. (This works because, in reality, a TCP connection is usually established directly between the origin and the destination servers without any intermediate servers.) Email from a server that isn’t in a domain’s SPF record is considered suspicious. Gmail displays the result of a successful SPF verification by saying an email was Mailed by: ... a domain.

To see a server’s SPF record, perform the following:

$ nslookup -q=txt tumblr.com
> ...

DKIM is public-key cryptography applied to email. A server releases one or more public keys through its DNS records. Each public key is available through a TXT DNS lookup at key._domainkey.example.com.

$ nslookup -q=txt 20120113._domainkey.gmail.com
> 20120113._domainkey.gmail.com	text = "k=rsa\; p=MIIB
> ...

The origin server declares which headers are “signed headers” and then applies its signature to both the signed headers and the email body. Gmail displays the result of a successful DKIM verification by saying an email was Signed by: ... a domain.

It is important to note that any Internet-connected computer can act as a origin server for email. However, only servers designated by a DNS MX record can be configured to act as the receiving server. When you are designing your own email-enabled web applications, it is important to keep these authentication mechanisms in mind to ensure that your email comes only from trustworthy SMTP origin servers and that you take proper actions to prevent your emails from being classified as spam.