Formatting source code: What universities don’t teach you

Some programmers write code everyday completely unaware of their number-one most essential tool even works: the text editor. There’s a good reason for this. Most compilers and interpreters don’t care how code is formatted, and unless the language you’re working with cares about file encoding or white space, there’s no reason that you’ll ever have cause to investigate further. Here are 3 things every programmer should know about formatting source code:

Carriage Returns, Line Feeds

Have you ever experienced one of these issues?

  • You open somebody’s source code only to find that everything is on one line.
  • At the end of every line of code, you see a ^M character.

Depending on your operating system, you’ve probably encountered at least one of these. The problem arises from a disagreement among programmers about how to best represent the beginning of a new line of text. You see, Macintosh computers used to use the Carriage Return character (Char 13, 0x0D) commonly represented by \r to indicate that two pieces of text were on separate lines (carriage refers to part on a typewriter). This conflicted with the Unix convention that the New Line character (Char 10, 0x0A) represented by \n should be used.

To maintain compatibility with both systems, Microsoft Windows decided that both characters should be used wherever a new line of text began, and by convention, the order of the characters was to always be \r\n, and never the reverse.

Newer versions of the Macintosh operating system are Unix-like and no longer use the \r character, but Windows still maintains their double-character convention. To see the difference this makes, compare the following two binary sequences. This one uses the Unix convention (Linux, OS X, etc.):

00000000  59 6f 75 20 67 75 79 73  20 61 72 65 20 73 6f 20
00000010  67 72 65 61 74 2e 0a 48  61 68 61 0a

And this one uses the Windows convention:

00000000  59 6f 75 20 67 75 79 73  20 61 72 65 20 73 6f 20
00000010  67 72 65 61 74 2e 0d 0a  48 61 68 61 0d 0a

The ^M character that you may have seen comes from the fact that Carriage Return’s big-endian binary representation is 000 0101 (0x0D), while Capital M’s representation is 100 0101 (0x4D), different only by their MSB!

Since the Carriage Return convention is no longer popular and the CRLF (Windows) convention is only used by one major, modern operating system, you should stick to the Unix convention when formatting your source code unless you’re implementing a Internet standard that specifies CRLF. On vim, you can check your line-ending convention with :set fileformat or change it with :set fileformat=unix.

No doubt you’ve seen this in a version control commit log:

\ No newline at end of file

It is convention that there is a new line after the last line of a text file. This convention isn’t followed by some text editors. Some will even add a new line upon opening a file, before you even make any changes. In vim, you can toggle this behavior with :set noeol binary or :set eol nobinary.

Tabs vs Spaces

Hitting the Tab button usually inserts one horizontal tab character (\t, Char 9, 0x09), but this isn’t always the case. Sometimes your text editor will insert 4 spaces, 2 spaces, or 8 spaces. Furthermore, your text editor might display a tab character as 2, 4, 8, or any number of characters. While liberal application of tabs and spaces can improve the readability of your code, it is all for naught if your reader’s spaces-per-tab ratio is different from yours.

When you edit existing source code, you should make sure to follow the spacing convention of the existing code. If tabs are used, then use tabs. If spaces are used, use spaces. If they are mixed, complain to the author.

On Vim, there is a plugin called vim sleuth that will detect the spacing style of existing source code and try to match it. It sets several important Vim settings:

  • :set tabstop (ts) – Your spaces-per-tab ratio.
  • :set expandtab (et) – Whether or not spaces should be used instead of tabs.
  • :set shiftwidth (sw) – How many characters your cursor will shift to the right when you hit the Tab key. If shiftwidth is not a multiple of tabstop, then spaces will be used to satisfy the remainder.

Occasionally, it may be useful to adjust these yourself in order to more easily read code that has been formatted with both tabs and spaces (ಠ_ಠ). For your own source code, you should always exclusively use either spaces or tabs, and never both.

File Encoding

If this hasn’t happened to you before, you’ve probably at least seen it around the Internet: apostrophes like the one in Roger’s get turned into funky stuff like Roger’s. File encodings are exactly why this happens.

The apostrophe used in the example above isn’t Apostraphe, ASCII Char 0x27, but Unicode Character RIGHT SINGLE QUOTATION MARK 0x2019. Unicode is an extended character encoding that provides support for things like Chinese characters and happy faces. It uses different sequences of bits to express different characters.

The only reason why Roger is still readable in the above example is the fact that ASCII characters originally only used 7 bits per character. Obviously, representing thousands of Chinese characters in only 7 bits is impossible, and switching to 16 bits per character would double the size of any text file. This is why other file encodings are variable-length encodings and use the unused 0x80 to 0xFF range of ASCII to indicate control characters, like characters that indicate how many bytes a certain character takes up. The important part is, plain ASCII is always valid in the other encodings.

The issue arises when extended character set encodings are mixed and the programmer forgets to specify which is being used. So, back to the example above. That special quotation mark is a valid unicode UTF-8 character. Here is its hexdump:

00000000  e2 80 99

In binary, it’s:

11100010 10000000 10011001

UTF-8’s run-length encoding works by indicating the length of the character using the position of the first 0 in the character’s binary encoding. Every subsequent data-character begins with 0b10. In this case, the first 0 follows 3 1s, which indicates that this is a 16-bit character composed of 3 bytes. If you add up the remainder bits (4 in the first byte, 6 in the second byte, and 6 in the third byte), you see that they add up to 16 data bits:

11100010 10000000 10011001 (Control bits)
11100010 10000000 10011001 (Data bits)

Reassemble those data bits and you get 00100000 00011001 or 0x2019, the hexadecimal character code I mentioned before!

Many text editors now interpret source code as UTF-8 by default, and only try the ISO-8859-1 standard when it encounters a illegal UTF-8 character. However, web browsers usually do the opposite. Depending on the HTTP Content-Type header and doctype, a web browser may sometimes interpret a document as ISO-8859-1 by default when there is no explicit Content-Encoding specified (though you should always specify this, either in your headers, or better yet, in the HTML itself as a meta tag). The same bits above, when interpreted as an ISO-8599-1 sequence produce:

’

ISO-8599-1 is also known as latin-1. Each character is 1 byte long, and there are no multi-byte characters. The ISO-8599-1 encoding of those three characters is e2 80 99, exactly the same as the Unicode encoding from before.