Code.RogerHub https://rogerhub.com/~r/code.rogerhub The programming blog at RogerHub Fri, 27 Mar 2015 23:04:04 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.2 Writing essays in VIM and Latex https://rogerhub.com/~r/code.rogerhub/infrastructure/562/writing-essays-in-vim-and-latex/ https://rogerhub.com/~r/code.rogerhub/infrastructure/562/writing-essays-in-vim-and-latex/#comments Fri, 27 Mar 2015 23:03:45 +0000 https://rogerhub.com/~r/code.rogerhub/?p=562 I’m only taking one humanities course this semester, so I don’t have that many essays to write. But for the few assignments in that course, I’m planning to use VIM and Latex to write my essays. I have Pages.app installed on my laptop, and I already use Google Docs for stuff, but I’m still more comfortable writing in VIM, even if I’m not writing code.

Smart quotes

Latex handles smart quotes like you’d expect it to. If there are single-quotes or double-quotes in the text section of your document, they’ll get turned into left and right quotes in the final PDF.

Spell checking

I just use VIM’s built-in spell-checking feature. Make sure you use it with MacVIm/GVIM, or else it looks terrible. Apparently there’s a way to bring up the correction suggestions list too, if you look for it.

Rich formatting

You can do bold and italics and lists in Latex. I like it even better than the WYSIWYG editors, since you can do line manipulation and fancy macro stuff in VIM too.

Instant feedback

You can mix inotify-tools and your Latex compiler to get you basically instant feedback. Every time you save the file, you can get it to re-compile your Latex and Preview.app will reload it. I’m just sharing my home folder through VirtualBox shared folders (configured in Vagrant), so I can work with files directly on the VM.

]]>
https://rogerhub.com/~r/code.rogerhub/infrastructure/562/writing-essays-in-vim-and-latex/feed/ 0
MongoDB power usage on laptop https://rogerhub.com/~r/code.rogerhub/infrastructure/558/mongodb-power-usage-on-laptop/ https://rogerhub.com/~r/code.rogerhub/infrastructure/558/mongodb-power-usage-on-laptop/#comments Fri, 27 Mar 2015 22:51:55 +0000 https://rogerhub.com/~r/code.rogerhub/?p=558 I’m setting up a Chef-based Ubuntu VM on my laptop, so I can get sandboxing and all the usual Ubuntu tools. Most things work on OS X too, but it’s a second-class citizen compared to poster child Ubuntu. I have Mongo and a few services set up there, for school stuff, and I noticed that Mongo regularly uses a significant amount of CPU time, even when it’s not handling any requests.

I attached strace to it with strace -f -p 1056, and here’s what I saw:

[pid 1118] select(12, [11], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1112] <... nanosleep resumed> 0x7fd3895319a0) = 0
[pid 1112] nanosleep({0, 34000000}, <unfinished ...>
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1118] <... select resumed> ) = 0 (Timeout)
[pid 1056] <... select resumed> ) = 0 (Timeout)
[pid 1118] select(12, [11], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1056] select(11, [9 10], NULL, NULL, {0, 10000} <unfinished ...>
[pid 1112] <... nanosleep resumed> 0x7fd3895319a0) = 0

I looked online for an explanation, and I found this bug about MongoDB power usage. Apparently, it uses select() as a way to do timekeeping for all the ways Mongo uses wall time. Somebody proposed that alternative methods be used instead, which don’t require this kind of tight looping, but it looks like the Mongo team is busy with other stuff right now and won’t consider the patch.

So I added a clause in my Chef config to disable the MongoDB service from starting automatically on boot. I needed to specify the Upstart service provider explicitly in the configuration, since the mongodb package installs both a SysV script and Upstart manifest.

]]>
https://rogerhub.com/~r/code.rogerhub/infrastructure/558/mongodb-power-usage-on-laptop/feed/ 0
Comma mode for vim https://rogerhub.com/~r/code.rogerhub/vim/555/comma-mode-for-vim/ https://rogerhub.com/~r/code.rogerhub/vim/555/comma-mode-for-vim/#comments Sun, 05 Oct 2014 09:45:48 +0000 https://rogerhub.com/~r/code.rogerhub/?p=555 Most keyboards have a Control and Meta key. Macs also have the Command key. Even with all the combinations of Control and Meta and Command with the letters and numbers, it’s sometimes hard to have enough space for keyboard shortcuts that don’t conflict and override each other. This is especially important in vim, where keyboard shortcuts are everything. Most of the good ones are taken already.

My keybindings

I adapted an idea from tinymode.vim and created another key modifier: the comma. Here are some of the keybindings I’ve set:

  • ,v creates a vertical split
  • ,t opens a new tab
  • ,1 toggles line wrapping
  • ,2 opens NERDTree
  • ,0 enables spellcheck and textwidth
  • ,q quits the current buffer
  • ,b opens CtrlP in buffer mode
  • ,c opens the cd prompt
  • ,f opens the filetype prompt
  • ,U toggles line numbers

The comma works just like Control or Meta. You can hold it down and press multiple keyboard shortcuts. For example, multiple q’s can be chained to close multiple buffers in succession. Or t and v can be chained to create a new 2-pane tab.

Configuration

These six lines form the core of comma mode:

nmap , :set timeoutlen=86400000<CR><SID>ldr
vmap , :set timeoutlen=86400000<CR><SID>ldr
nn <script> <SID>ldr, <SID>ldr
vn <script> <SID>ldr, <SID>ldr
nmap <SID>ldr :set timeoutlen=1000<CR>
vmap <SID>ldr :set timeoutlen=1000<CR>

Pressing the comma key will 1) set the key combination delay to an absurdly long time, and 2) set up the <SID>ldr keyword for the following comma command. Vim provides <SID> as a script-local unique identifier to help you avoid naming conflicts.

Chainable commands

Once comma mode is set up, you can add chainable commands. These commands can be chained together with other comma commands in succession.

nn <script> <SID>ldr1 :set wrap!<CR><SID>ldr
vn <script> <SID>ldr1 :set wrap!<CR><SID>ldr

Everything after the first <SID>ldr is the comma keybinding. In this case, just the number 1. An arbitrary command is run, and then the line ends with <SID>ldr to set up chaining for the next command.

Non-chainable commands

You can also add non-chainable commands. These are useful for comma shortcuts that open prompts, which you would never use with other comma shortcuts.

nn <script> <SID>ldrc :set timeoutlen=1000<CR>:cd 
vn <script> <SID>ldrc :set timeoutlen=1000<CR>:cd

There’s an invisible space at the end of these two lines, so the prompt is ready to accept a path. Non-chainable commands reset the timeoutlen to a default 1000. This is the same behavior as the null binding (the last two lines of the Configuration section), which runs when you run a non-existent comma command, or press ESC in comma mode.

Conclusion

Comma mode has been the #1 most productivity-boosting featured I’ve added to vim. Mapping semicolon to colon is a start, but it only takes :q from 4 keystrokes to 3 keystrokes (including the Enter key). With comma mode, this is now just 2 keystrokes — just the way keyboard shortcuts should be.

]]>
https://rogerhub.com/~r/code.rogerhub/vim/555/comma-mode-for-vim/feed/ 2
Taking advantage of cloud VM-driven development https://rogerhub.com/~r/code.rogerhub/infrastructure/538/taking-advantage-of-cloud-vm-driven-development/ https://rogerhub.com/~r/code.rogerhub/infrastructure/538/taking-advantage-of-cloud-vm-driven-development/#comments Thu, 10 Jul 2014 06:39:48 +0000 https://rogerhub.com/~r/code.rogerhub/?p=538 Most people write about cloud computing as it relates to their service infrastructure. It’s exciting to hear about how Netflix and Dropbox et al. use AWS to support their operations, but all of those large-scale ideas don’t really mean much for the average developer. Most people don’t have the budget or the need for enormous computing power or highly-available SaaS data storage like S3, but it turns out that cloud-based VM’s can be highly useful for the average developer in a different way.

Sometimes, you see something like this one-liner installation script for Heroku Toolbelt, and you just get nervous:

wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh

Not only are they asking you to run a shell script downloaded over the Internet, but the script also asks for root privileges to install packages and stuff. Or, maybe you’re reading a blog post about some HA database cluster software and you want to try it out yourself, but running 3 virtual machines on your puny laptop is out of the questions.

To get around this issue, I’ve been using DigitalOcean machines for when I want to test something out but don’t want to go to the trouble of maintaining a development server or virtual machines. Virtualized cloud servers are great for this because:

  • They’re DIRT CHEAP. DO’s smallest machine costs $0.007 an hour. Even if you use it for 2 hours, it rounds down to 1 cent.
  • The internet connection is usually a lot better than whatever you’re using. Plus, most cloud providers have local mirrors for package manager stuff, which makes installing packages super fast.
  • Burstable CPU means that you can get an unfair amount of processing power for a short time at the beginning, which comes in handy for initially installing and downloading all the stuff you’ll want to have on your machine.

I use the tugboat client (a CLI ruby app) to interface with the DigitalOcean API. To try out MariaDB Galera clustering, I just opened up three terminals and had three SSH sessions going on. For source builds that have a dozen or more miscellaneous dependencies, I usually just prepare a simple build script that I can upload and run on a Cloud VM whenever I need it. When I’m done with a machine, I’ll shut it down until I need a machine again a few days later.

Running development virtual machines in the cloud might not drastically change your workflow, but it opens up a lot of opportunities for experimentation and massive computing resources when you want it. So load up a couple dollars onto a fresh DigitalOcean account and boot up some VMs!

]]>
https://rogerhub.com/~r/code.rogerhub/infrastructure/538/taking-advantage-of-cloud-vm-driven-development/feed/ 0
Getting a python-like shell for Perl https://rogerhub.com/~r/code.rogerhub/infrastructure/541/getting-a-python-like-shell-for-perl/ https://rogerhub.com/~r/code.rogerhub/infrastructure/541/getting-a-python-like-shell-for-perl/#comments Wed, 26 Mar 2014 22:01:48 +0000 https://rogerhub.com/~r/code.rogerhub/?p=541 When learning a new language, I find it really helpful if I can test out new syntax and language constructs without going through the trouble of creating a file and all the boilerplate along with it. Perl doesn’t have this capability built-in, as far as I know, but there’s this great CPAN module that claims to do the same thing. It’s a short little module (just about 100 SLOC), and it gives you a familiar prompt:

Perl 5.14.2 (Tue Feb  4 23:09:53 UTC 2014) [linux panlong 2.6.42-37-generic #58-ubuntu smp thu jan 24 15:28:10 utc 2013 x86_64 x86_64 x86_64 gnulinux ]
Type "help;", "copyright;", or "license;" for more information.
>>> 

It’s not very obvious how to set up the thing though. I installed the module and its dependencies via cpanm, and then created this little snippet in a file called perlthon.pl:

#!/usr/bin/perl
use 5.010;
use strict;
use warnings;

use Perl::Shell;
Perl::Shell::shell();

Then, the shell started right up.

]]>
https://rogerhub.com/~r/code.rogerhub/infrastructure/541/getting-a-python-like-shell-for-perl/feed/ 1
Exporting comments from Facebook Comments to Disqus https://rogerhub.com/~r/code.rogerhub/programming/528/exporting-comments-from-facebook-comments-to-disqus/ https://rogerhub.com/~r/code.rogerhub/programming/528/exporting-comments-from-facebook-comments-to-disqus/#comments Sun, 16 Mar 2014 05:37:32 +0000 https://rogerhub.com/~r/code.rogerhub/?p=528 About 14 months ago, The Daily Californian switched from Disqus to Facebook Comments in order to clean up the comments section of the website. But recently, the decision was reversed, and it was decided that Disqus would make a come back. I worked with one of my Online Developers to facilitate the process.

One of the stipulations of the Facebook to Disqus transition was that the Online Team would transfer a year’s worth of Facebook comments over to Disqus so that they would still remain on the website. Facebook doesn’t natively support any kind of data export feature for their commenting platform at the time of writing. I don’t expect that they will in the future. However, they provide a reliable API to their comments database. With a bit of ingenuity and persistence, we were able to successfully import over 99% of existing Facebook comments into Disqus.

Overview of the process

Disqus supports imports in the form of custom WXR files. These are WordPress-compatible XML files that contain things about posts (titles, dates, preview, id, etc.) and comments (name, date, IP, content, etc.).

The Daily Cal uses WordPress and the official Disqus WordPress plugin. The plugin identifies threads with a combination of WordPress’s internal post ID and a short permalink. Thread identifiers look like this:

var disqus_identifier = '528 https://rogerhub.com/~r/code.rogerhub/?p=528';

This one is taken right from the source code of this post (you can see for yourself).

Facebook, on the other hand, identifies threads by the content page’s URL. After all, their system was created for arbitrary content, not just blogs. The Facebook Open Graph API provides a good amount of information about comments. There’s enough information to identify multiple comments posted by a single user. There’s accurate timestamp and reply relationships. There isn’t any personal information like IP addresses, but names are provided.

The overall process looked like this:

  1. On dailycal.org, we needed an API endpoint to grab page URLs along with other information about threads on the site.
  2. For each of these URLs, we check Facebook’s Open Graph API for comments that were posted on that URL. If there are any, then we put them into our local database.
  3. After we are done processing comments for all of the articles ever published on dailycal.org, we can export them to Disqus-compatible WXR and upload them to Disqus.

This seems like a pretty straight-forward data hacking project, but there were a couple of issues that we ran into.

Nitty-gritty details

The primary API endpoint for Facebook’s Comment API is graph.facebook.com/comments/. This takes a couple of GET parameters:

  • limit — A maximum number of comments to return
  • ids — A comma-delimited list of article URLs
  • fields — For getting comment replies

The API supports pagination with cursors (next/prev URLs), but to make things more simple, we just hardcoded a limit parameter of 9999. By default, the endpoint will return only top-level comments. To get comment replies, you need to add a fields=comments parameter to the request.

You can make a few sample requests to get a feel for the structure of the JSON data returned.

Disqus supports a well-documented XML-based import format. In our case, we decided that names were sufficient identification, although Disqus will support some social login tagging in the import format as well. The format specifies a place for the article content, which is an odd request, since article content is usually quite massive. We decided to supply just an excerpt of the article content rather than the entire page.

There were a few more precautions we took before we started development. In order to lower suspicion around our scraping activity on Facebook as well as our own web application firewall (WAF), we grabbed the user agent of a typical Google Chrome client running on Windows 7 x86-64, and used that for all of our requests. We also created a couple of test import “sites” on Disqus, since the final destination of our generated WXR was the Disqus site that we used a year ago before the switch to Facebook. There isn’t any way to copy comments or clone a Disqus site, so we didn’t want to make any mistakes.

Unicode support and escape sequences

The first version of our program had terrible support for any kind of non-ASCII character. It’s not that our commenters were all typing in Arabic or something (actually, we had a couple of comments that really were in Arabic). Smart quotes are used in plain English, and they ruin the process as well.

Facebook’s API spits out JSON data, and uses JSON’s encoding. For example, the double left quotation mark, otherwise known as lrquo in HTML/XML, is encoded as \u201c using JSON’s unicode standard. However, the JSON data also contains HTML-encoded entities like &amp;. (Update: It appears that Facebook has corrected this issue.)

Python’s JSON library will take care of JSON escape sequences as it decodes the string into a native dictionary. However, the script applies HTML entity decoding on that result, in case there are any straggling escape sequences left. Since Disqus’s WXR format suggests that you throw the comment content into a CDATA block, all you need to escape is the CDATA ending sequence, ]]>. You can do this by splitting it up into 2 CDATA sections (e.g. ]]]><![CDATA[]>)

HTTP exceptions

Our API endpoint would timeout or throw an error every once in a while. To make our scraper more robust, we set up handlers for HTTP error responses. The scraper would retry an API request at least 5 times before giving up. If none of the attempts are successful, the URL is logged to the database for further debugging.

Commenter email addresses

Every commenter listed in the WXR needs an email address. Comments with the same email address will be tied together, and if somebody ever registers a Disqus account with that email address, they can claim the comments as their own (and edit/delete them). Facebook provides a unique ID that will associate multiple comments by the same person. But since Facebook comments also allows AOL and Yahoo! logins, not every comment has such an ID. Our script used the Facebook-provided ID when it was present, and generated a random one outside of Facebook’s typical range when it wasn’t. All of the emails ended with @dailycal.org, which meant that we would retain control over the registration verification, in case we needed it.

Edge cases

Disqus requires that comments be at least 2 characters long. There were a couple of Facebook comments that consisted of just one word: “B” or “G” or the like. These had to be filtered out before the XML export process.

We also ran into a case where a visitor commented “B<newline>” on the Facebook comments. For Disqus, this still counts as one character, since the CDATA body is stripped of leading and trailing whitespace before processing. The first version of our script didn’t strip whitespace before checking the length, so it failed to filter out this erroneous comment.

Timezone issues

After a couple of successful trials, we created a test site in Disqus and imported a sample of the generated WXR. Everything looked good until we went and cross-referenced Disqus’s data with the Facebook comments that were displayed on the site. The comment times appeared to be around 5 hours off!

Here in California, we’re GMT -0800, so there wasn’t a clear explanation why the comment times were delayed. WXR specified GMT times, which we verified were correct. The bug seemed to be on Disqus’s end. We contacted Disqus support and posted on their developer’s mailing list, but after around a week, we decided that it would be easiest to just counteract the delay with an offset in the opposite direction.

import datetime
time_offset = datetime.timedelta(0, -5*60*60)
comment_date = (datetime.datetime.strptime(
    comment['created_time'], "%Y-%m-%dT%H:%M:%S+0000"
    ) + time_offset).strftime("%Y-%m-%d %H:%M:%S")

Conclusion

After a dozen successful test runs, we pulled the trigger and unloaded the WXR onto the live Disqus site. The import process finished within 5 minutes, and everything worked without a hitch.

]]>
https://rogerhub.com/~r/code.rogerhub/programming/528/exporting-comments-from-facebook-comments-to-disqus/feed/ 7
How to write fast-rendering HTML and CSS https://rogerhub.com/~r/code.rogerhub/programming/521/how-to-write-fast-rendering-html-and-css/ https://rogerhub.com/~r/code.rogerhub/programming/521/how-to-write-fast-rendering-html-and-css/#comments Sun, 24 Nov 2013 06:04:41 +0000 https://rogerhub.com/~r/code.rogerhub/?p=521 RogerHub’s most popular page contains over 50,000 HTML nodes. It’s no easy task for a browser to chew through all of that HTML and render it quickly. However, most modern browsers are now able to render that particular page without any performance issues, including mobile and tablet web browsers. When it comes to website performance, most people are concerned with app server response time or JavaScript performance. There really aren’t a lot of things that you can do to make gigantic pages load more quickly, but here are some tricks I learned along the way.

Cut down on the CSS3

Drop shadows and gradients look great when you use them correctly, but when your document is 600,000px tall, they create serious browser lag no matter how modern your hardware may be. You get the fastest renderer performance with simple solid colors. Aesthetics are a small sacrifice when you’re trying to squeeze more speed out of your markup.

Hide unnecessary things

I keep all of my calculator’s comments on a single page because 1) that’s the way it has always been, and 2) SEO. However, many users never even look at those comments. It improves browser performance if you simply instruct the browser not to display most of the markup until they are requested, which brings me to the next point..

Limit the number of tree children

Rather than applying a universal .comment-hidden class to all older hidden comments, put hidden elements under a single unifying parent and apply styles to the parent instead. It’s much faster to style one parent element than a thousand children.

Fake it

The comments on RogerHub no longer support Gravatar-based photos, since literally hundreds of unique avatars were being requested and rendered on each page load. Since I didn’t want to take out the images entirely, everybody now gets a generic anonymous avatar. Much of the original comment markup has also been taken out, leaving only the bare necessities required to properly apply styles.

Use simple, non-nested classes

I don’t have any hard data behind this practice, but intuitively I feel that simple unnested classes are the fastest way to go about styling an arbitrary number of elements. Avoid tag selectors, deeply nested selectors, wildcard selectors, attribute/value selectors, etc.

]]>
https://rogerhub.com/~r/code.rogerhub/programming/521/how-to-write-fast-rendering-html-and-css/feed/ 0
Archiving and compressing a dynamic web application https://rogerhub.com/~r/code.rogerhub/programming/508/archiving-and-compressing-a-dynamic-web-application/ https://rogerhub.com/~r/code.rogerhub/programming/508/archiving-and-compressing-a-dynamic-web-application/#comments Sun, 01 Sep 2013 05:04:22 +0000 https://rogerhub.com/~r/code.rogerhub/?p=508 From 1999 to mid-2011, the Daily Cal used an in-house CMS to run the website. It contains around 65,000 individual articles and thousands of images. But ever since we moved to WordPress, the old system has been collecting dust in the corner. It was about time that all of the old CMS content was archived as static HTML so that it could be served indefinitely in the future as server software evolves. To accomplish this, I set up a linux virtual machine with my trusty vagrant up utility on my spare home server.

Retrieving the application data

The CMS’s production server did not actually have enough free disk space to create an tarball copy of the application. But since there were on the order of 10,000 files involved, a recursive copy via scp would be too slow. In order to speed up the transfer process, I used a little gnu/linux philosophy and piped a few things together:

$ screen
$ ssh root@domain "tar -c /srv/archive" | tar xvf -

I decided that compression was not going to be very helpful because most of the data was jpeg and png images, which are not very compressible. Enabling compression would just slow things down, since the bottleneck would become the CPU rather than the network.

Preparing the application

The CMS had very few dependencies, which is not surprising given the state of PHP 10 years ago. I set up a simple nginx+php-fpm+mysql configuration with a single PHP worker thread. The crawl operation would be executed in serial anyway, so multiple workers would not be useful.

I also added an entry to the VM’s hosts file for the hostname of the production server. The server’s hostname was hardcoded in a few places, and I didn’t want the crawl operation sending requests out to Internet. Additionally, I set up a secondary web server configuration that served generated HTML from the output directory and static assets from the application data, so that I could preview the results as they were being generated.

Crawling the application

Generating the static pages was the hardest part of the archival process, and it took me around 5 retries to get it right. The primary purpose of the whole archival operation was to remove the dynamic portions of the web application. This meant that static versions of every single conceivable page request had to be run against the application and saved. I picked wget as my archival tool.

Articles in the CMS were stored in one of two places. However, the format of the URL was luckily the same for both. I dumped article IDs from the MySQL database and created a seed file of article URLs:

$ echo "SELECT article_id FROM dailycal.article;" | mysql -u root > article_ids
$ echo "SELECT id FROM dailycal.h_articles;" | mysql -u root  >> article_ids
$ sed -e 's/^/http:\/\/archive.dailycal.org\/article.php?id=/' -i article_ids

I didn’t see much point in setting a root password for the local MySQL installation, since this was a single-use VM anyway. Sed ate through the 65,326 article IDs in seconds. I then created a second seed file containing just the URL of the application root, from where (nearly) all other pages would be crawlable.

On the crawler’s final run, I set the following command line switches:

  • --trust-server-names – Sets the output file name according to the last request in a redirection chain. By default, wget uses the first request.
  • --append-output=generator.log – Outputs progress information to a file, so that I can run the main process in screen and monitor it with tail in follow mode.
  • --input-file=source.txt – Specifies the seed file of URLs.
  • --timestamping – Sets the file modification time according to HTTP headers, if available.
  • --no-host-directories – Disables the creation of per-host directories.
  • --default-page=index.php – Defines the name for request paths that end in a slash.
  • --html-extension – Appends the html file extension to all generated pages, even if another extension already exists.
  • --domains archive.dailycal.org – Restricts the crawl to only the application domain.

Additionally, I set the following switches to crawl through the links on the application’s root page.

  • -r – Enables recursive downloading.
  • --mirror – Sets infinite recursion depth and other useful options.

In total, 543MB of non-article HTML and 2.1GB of article HTML were generated. These are reasonable sizes, given how many URLs were crawled in total, but they were still a bit unwieldy to store. I looked for a solution.

Serving from a compressed archive

I knew a couple of data facts about the generated HTML:

  1. There was tons of redundancy. The articles shared much of their header and footer markup.
  2. Virtually all of the data consisted of printable characters and whitespace, which means considerably less unique information than 8 bits per byte.

Both of these factors made the HTML a good candidate for archive compression. My first thought was tar+gzip, but tar+gzip compression works on blocks, not files. To extract a single file, you’d need to parse all the data up till that file. A request for the last file in the archive could take 15 to 20 seconds!

Luckily, the zip file format maintains an index of individual files and compresses them individually, which means that single file extraction is instantaneous no matter where in the archive it is located. I opted to use fuse-zip, an extension for FUSE that lets you mount zip files onto the file system. Fully compressed, the 543MB of pages became a 92MB zip archive (83% deflation), and the 2.1GB of articles became a 407MB zip archive (81% deflation).

After everything was finished, I uploaded the newly created HTML archives to a new production server and shut down the old CMS. From there, a decade’s worth of archived articles can be maintained indefinitely for the future.

]]>
https://rogerhub.com/~r/code.rogerhub/programming/508/archiving-and-compressing-a-dynamic-web-application/feed/ 0
Self-contained build environments with Vagrant https://rogerhub.com/~r/code.rogerhub/infrastructure/504/self-contained-build-environments-with-vagrant/ https://rogerhub.com/~r/code.rogerhub/infrastructure/504/self-contained-build-environments-with-vagrant/#comments Tue, 20 Aug 2013 00:22:54 +0000 https://rogerhub.com/~r/code.rogerhub/?p=504 Vagrant is a nifty piece of Ruby software that lets you set up Virtual Machines with an unparalleled amount of automation. It interfaces with a VM provider like VirtualBox, and helps you set up and tear down VM’s as you need them. I like it better than Juju because there isn’t as much hand-holding involved, and I like it better than vanilla Puppet because I don’t regularly deploy a thousand VM’s at a time. At the Daily Cal, I’ve used Vagrant to help developers set up their own build environments for our software where they can write code and test features in isolation. I also use it as a general-purpose VM manager on my home file server, so I can build and test server software in a sandbox.

You can run Vagrant on your laptop, but I think that it’s the wrong piece of hardware for the job. Long-running VM batch jobs and build environment VM’s should be run by headless servers, where you don’t have to worry about excessive heat, power consumption, huge amounts of I/O, and keeping your laptop on so it doesn’t suspend. My server at home is set up with:

  • 1000GB of disk space backed by RAID1
  • Loud 3000RPM fans I bought off a Chinese guy four years ago
  • Repurposed consumer-grade CPU and memory (4GB) from an old desktop PC

You don’t need great hardware to run a couple of Linux VM’s. Since my server is basically hidden in the corner, the noisy fans are not a problem and actually do a great job of keeping everything cool under load. RAID mirroring (I’m hoping) will provide high availability, and since the server’s data is easily replaceable, I don’t need to worry about backups. Setting up your own server is usually cheaper than persistent storage on public clouds like AWS, but your mileage may vary.

Vagrant configuration is a single Ruby file named Vagrantfile in the working directory of your vagrant process. My basic Vagrantfile just sets up a virtual machine with Vagrant’s preconfigured Ubuntu 12.04LTS image. They offer other preconfigured images, but this is what I’m most familiar with.

# Vagrantfile

Vagrant.configure("2") do |config|
  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "precise32"

  # The url from where the 'config.vm.box' box will be fetched if it
  # doesn't already exist on the user's system.
  config.vm.box_url = "http://files.vagrantup.com/precise32.box"

  config.vm.network :forwarded_port, guest: 8080, host: 8080

  # Enable public network access from the VM. This is required so that
  # the machine can access the Internet and download required packages.
  config.vm.network :public_network

end

For long-running batch jobs, I like keeping a CPU Execution Cap on my VM’s so that they don’t overwork the system. The cap keeps the temperature down and prevents the VM from interfering with other server processes. You can add an execution cap (for VirtualBox only) by appending the following before the end of your primary configuration block:

# Adding a CPU Execution Cap

Vagrant.configure("2") do |config|
  ...

  config.vm.provider "virtualbox" do |v|
    v.customize ["modifyvm", :id, "--cpuexecutioncap", "40"]
  end

end

After setting up Vagrant’s configuration, create a new directory containing only the Vagrantfile and run vagrant up to set up the VM. Other useful commands include:

  • vagrant ssh — Opens a shell session to the VM
  • vagrant halt — Halts the VM gracefully (Vagrant will connect via SSH)
  • vagrant status — Checks the current status of the VM
  • vagrant destroy — Destroys the VM

Finally, to set up the build environment automatically every time you create a new Vagrant VM, you can write provisioners. Vagrant supports complex provisioning frameworks like Puppet and Chef, but you can also write a provisioner that’s just a shell script. To do so, add the following inside your Vagrantfile:

Vagrant.configure("2") do |config|
  ...

  config.vm.provision :shell, :path => "bootstrap.sh"
end

Then just stick your provisioner next to your Vagrantfile, and it will execute every time you start your VM. You can write commands to fetch package lists and upgrade system software, or to install build dependencies and check out source code. By default, Vagrant’s current working directory is also mounted on the VM guest as a folder named vagrant in the file system root. You can refer to other provisioner dependencies this way.

Vagrant uses a single public/private keypair for all of its default images. The private key can usually be found in your home directory as ~/.vagrant.d/insecure_private_key. You can add it to your ssh-agent and open your own SSH connections to your VM without Vagrant’s help.

Even if you accidentally mess up your Vagrant configuration, you can use VirtualBox’s built-in command-line tools to fix boot configuration issues or ssh daemon issues.

$ VBoxManage list vms
...
$ VBoxManage controlvm <name|uuid> pause|reset|poweroff|etc
...
$ VBoxHeadless -startvm <name|uuid> --vnc
...
(connect via VNC)

The great thing about Vagrant’s VM-provider abstraction layer is that you can grab the VM images from VirtualBox and boot them on another server with VirtualBox installed, without Vagrant completely. Vagrant is a excellent support tool for programmers (and combined with SSH tunneling, it is great for web developers as well). If you don’t already have support from some sort of VM infrastructure, you should into possibilities with Vagrant.

]]>
https://rogerhub.com/~r/code.rogerhub/infrastructure/504/self-contained-build-environments-with-vagrant/feed/ 0
elementary OS, a distribution like no other https://rogerhub.com/~r/code.rogerhub/infrastructure/488/elementary-os-a-distribution-like-no-other/ https://rogerhub.com/~r/code.rogerhub/infrastructure/488/elementary-os-a-distribution-like-no-other/#comments Sun, 18 Aug 2013 04:55:48 +0000 https://rogerhub.com/~r/code.rogerhub/?p=488 There are a surprising number of people who hate elementary OS. They say that elementary is technically just a distribution of linux, not an OS. They say that it is too similar to OS X. They say that the developers are in over their heads. All of these things may be true, but I do not care. I am sick of ascetic desktop environments without animations. I am tired of not having a compositor. I don’t need a dozen GUI applications to hold my hand, but I hate having to fix things that don’t work out of the box. You are not Richard Stallman. Face it: modern people, whether or not they are also computer programmers, do not live in the terminal, and most of us aren’t using laptops with hardware built in 2004. And finally, I am sick of blue and black window decorations that look like they were designed by a 12 year old.

Elementary OS Luna is the first distribution of linux I’ve used where I don’t feel like changing a thing. The desktop environment defaults are excellent, and all of Ubuntu’s excellent hardware support, community of PPA’s, and familiar package manager are available. At the same time, there is a lot of graphical magic, window animation, and attention to detail that is quite similar to OS X. There is a minimal amount of hand-holding, and it’s quite easy to get used to the desktop because of the intuitive keyboard shortcuts and great application integration.

You can’t tell from just the screenshot above, but elementary OS is more than just a desktop environment. The distribution comes packaged with a bunch of custom-built applications, like the custom file manager and terminal app. Other apps like a IRC client, social media client, and system search are available in community PPA’s. I do most of my work through either the web browser, ssh, or vim. Important data on my laptop is limited to personal files on my Google Drive, and a directory of projects in active development that’s regularly backed up to my personal file server. Programs can be painlessly installed from the package manager, and configuration files are symlinked from my Google Drive. I’m not very attached to the current state of my laptop at any given moment because all the data on it is replaceable, with the exception of OS tweaks. I don’t like having to install a dozen customization packages to get my workflow to how I like it, so out-of-box experience is very important to me.

I would say that if you’re a regular linux user, you should at least give elementary OS Luna a try, even if it’s on a friend’s machine or in a VM. You may be surprised. I was.

]]>
https://rogerhub.com/~r/code.rogerhub/infrastructure/488/elementary-os-a-distribution-like-no-other/feed/ 2