Grabbing an apache-generated directory listing

So it turns out that one of my professors, whose lectures I attend less frequently than I should, chooses to play classical music at the beginning of lecture when students are walking in. A couple of people started a thread on Piazza back when the course began to identify as many of these tracks as they could. On request, he posted a ripped vinyl on the course web page, which of course, is described with the default apache2 directory listing page, MIME-type icons and everything.

Short of a wget -r, it’s embarrassingly difficult to grab directories that are laid out like this from the command line. It isn’t such a big deal with only 14 files, but this number could easily scale up to a hundred, in which case you’d probably decide a programmatic solution would be worth it. For fun, I came up with the following:

for f in $(curl "http://www.cs.berkeley.edu/.../" 2>/dev/null | \
grep 'href="(.*?)\.m4a"' -o -P | cut -d '"' -f 2); \
do wget "http://www.cs.berkeley.edu/.../$f"; done

The three lines are split by a single backslash character (\) at the end of each line. This indicates to bash that the lines are meant to be treated as a single-line command (since I typed it as such in the first place).

The stuff inside the $( ... ) consists of a curl [url] 2>/dev/null piped into grep. The 2>/dev/null redirects the Standard Error stream to a black hole so that it isn’t displayed on the screen. This is to prevent curl from showing any information about its progress. (Curl also supports a --silent command-line switch that does the same thing.)

The grep simply searches for URL’s that link to a *.m4a file. The (.*?) syntax has special meaning in PERL-compatible grep, which I am invoking here with the -P switch. PERL supports a non-greedy syntax that translates to telling regular expression wildcards like .* to match as few characters as possible. This syntax is invoked, in this case, with the question mark. The parentheses in this case are unnecessary.

The -o command line switch tells grep only to print out the matching parts of the input, rather than their entire lines. The rest of the code just loops through the URL’s and prepends the absolute address of these files to the path on its way to wget.