Using locate to quickly change directories

You’ve probably got a bunch of directories and subdirectories in your home folder that are organized logically, rather than a mess of top-level directories scattered all over the place. Although the former is cleaner and better organized, it also takes more time to get to where you’d like to be. Luckily, linux comes with /usr/bin/locate to help you find files. I made the following for my bashrc:

goto () { cd "$(locate -i "$@" | grep -P "/home/roger/(Dropbox|Documents|Downloads|Development|Desktop)" | awk '{ print length(), $0 | "sort -n" }' | head -n 1 | cut -d " " -f2-)"; }

I’ll go through it part by part:

  • locate -i "$@" – Locate uses the database generated by updatedb to find files. The -i flag tells locate to ignore case.
  • grep -P "..." – I only want to select directories that are in one of the places I store my stuff. This also means that results from web browser caches and temp files will be ignored. You should obviously change this regular expression. The -P flag specifies PCRE (Perl-compatible regular expressions), since I am most comfortable with those.
  • awk '{ print length(), $0 | "sort -n" }' – Awk is a mini string manipulation language that you can use on the command line. In this case, it just prefixes each string with its length and sorts them.
  • head -n 1 – After the sorting, I just want the shortest result, so I grab that one.
  • cut -d " " -f2- – Now, I get rid of the length and keep everything else, which is the path I wanted. The -d flag tells cut to use a space as the delimiter (the default is a tab character).
  • All of this is wrapped in a cd "$( ... )";. Bash will execute the contents of the $( ... ) and feed the textual result into cd as the argument.

It isn’t as fast as Apple’s spotlight search, but the difference is negligible. For greater performance, you can customize the system-wide behavior of updatedb to search in fewer directories.

Anti-fraud Detection in Best of Berkeley

Near the end of Spring Break, I helped build the back-end for the Daily Cal’s Best of Berkeley voting website. The awards are given to restaurants and organizations chosen via public voting over the period of a week. Somewhere during the development, we decided it’d be more effective to implement fraud detection rather than prevention. The latter would only encourage evasion and resistance while with the former, we could sit idle and track fraud as it occurred. It made for some interesting statistics too.

graph2

This first one’s simple. One of the candidates earned a highly suspicious number of submissions where they were the only choice selected in any of the 39 categories and 195 total candidates. Our fraud-recognition system aggregated sets of submission choices and raised alerts when abnormal numbers of identical choices appeared. The graph shows the frequency of submissions where only this candidate was selected and demonstrates the regularity and nonrandomness of these fraudulent entries.

Combined with data from tracking cookies and user agents, it’s safe to say that these submissions could be cast out.

graph1

The system also calculated and analyzed the elapsed time between successive submissions. It alerted both for abnormal volume, when a large number of submissions were received in a small time, and for abnormal regularity, when submissions came in at abnormally regular intervals. From the graph, it looks like it takes about 10.5 to 12 seconds for the whole process: reload, scroll, check vote, scroll, submit.

The calculations for this alert were a bit trickier than I expected. At first, I thought of using a queue where old entries would be discarded:

s := queue()
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold:
    send_alert
    s->clear

This doesn’t work very well. Each time the queue length exceeded the threshold, it would flush the queue and notify that threshold submissions were detected in abnormal volume. So, I added a minimum time-out before another alert would be raised.

s := queue()
last_alert := 0
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold and now - last_alert > threshold:
    send_alert
    s->clear
    last_alert := now

The regularity detector performed a similar task, except it would store each of the time differences in a list, sort it, and then run with a smaller threshold (around 0.3 seconds). Ideally, observations about true randomness suggest that these bars should be more or less horizontal, but this is hardly the case. After these fraudulent entries were removed, this particular candidate was left with a paltry 70-some votes, about 5% of its pre-filtered count.

Grabbing an apache-generated directory listing

So it turns out that one of my professors, whose lectures I attend less frequently than I should, chooses to play classical music at the beginning of lecture when students are walking in. A couple of people started a thread on Piazza back when the course began to identify as many of these tracks as they could. On request, he posted a ripped vinyl on the course web page, which of course, is described with the default apache2 directory listing page, MIME-type icons and everything.

Short of a wget -r, it’s embarrassingly difficult to grab directories that are laid out like this from the command line. It isn’t such a big deal with only 14 files, but this number could easily scale up to a hundred, in which case you’d probably decide a programmatic solution would be worth it. For fun, I came up with the following:

for f in $(curl "http://www.cs.berkeley.edu/.../" 2>/dev/null | \
grep 'href="(.*?)\.m4a"' -o -P | cut -d '"' -f 2); \
do wget "http://www.cs.berkeley.edu/.../$f"; done

The three lines are split by a single backslash character (\) at the end of each line. This indicates to bash that the lines are meant to be treated as a single-line command (since I typed it as such in the first place).

The stuff inside the $( ... ) consists of a curl [url] 2>/dev/null piped into grep. The 2>/dev/null redirects the Standard Error stream to a black hole so that it isn’t displayed on the screen. This is to prevent curl from showing any information about its progress. (Curl also supports a --silent command-line switch that does the same thing.)

The grep simply searches for URL’s that link to a *.m4a file. The (.*?) syntax has special meaning in PERL-compatible grep, which I am invoking here with the -P switch. PERL supports a non-greedy syntax that translates to telling regular expression wildcards like .* to match as few characters as possible. This syntax is invoked, in this case, with the question mark. The parentheses in this case are unnecessary.

The -o command line switch tells grep only to print out the matching parts of the input, rather than their entire lines. The rest of the code just loops through the URL’s and prepends the absolute address of these files to the path on its way to wget.

How to tell if your system is big endian or little endian

I was on MDN when I noticed that Chrome’s V8 (at least) was little-endian, meaning that a hexadecimal number you’d write as 0xCAFEBABE is actually stored as BE BA FE CA if you read the bytes off memory. It made me wonder how you could most easily determine if your system is little or big endian. I then came upon this gem:

echo -n I | od -to2 | head -n1 | cut -f2 -d" " | cut -c6

I’ll go through it part by part:

  • echo -n I – You’re already familiar with echo. The -n switch just suppresses the newline \n character that, by default, follows the output. This part just prints the letter I.
  • od -to2 – If you look at man ascii, you’ll notice that in the 7-bit octal range (00 to 0177), the letter I is 0111, which has the convenient property of being all 1’s. od is a program that lets you inspect, in human-readable format, what may or may not be printable characters. The -to2 switch tells it to print 2 bytes in octal. (The -to1 switch would not work because both little-endian and big-endian machines would indistinguishably print 111.) Little-endian machines will give you 0000000 000111 while big-endian machines will give you 0000000 111000. Great! Let’s simplify this result down a bit.
  • head -n1 – Here’s an easy one. We’re just grabbing the first line of stdin and spitting it back out.
  • cut -f2 -d" " – Another good one to know. Cut is string.split(). The -f2 switch tells it that the second field is desired, and the -d" " switch gives it a single space character as a delimeter.
  • cut -c6 – Finally, here’s another cut. This one just grabs the 6th character. The 4th or the 5th would also suffice.

You could take this one-liner a step further and add conditional printing of “big endian” or “little endian”, but by point #2, it’s pretty much there.

A better better calculator with bc

GNU/bc is a command-line calculator for linux. It does variables [a-z0-9_]+ and +-*/^ and basic math, all through the terminal, and while it’s somewhat useful by default, it can be made to be a lot better. Reading man bc, you see that the program reads an environmental variable, BC_ENV_ARGS, which refers to a list of command line arguments implicitly appended to every invocation of bc. Quoting the source, those files “typically contain function definitions for functions the user wants defined every time bc is run”.

alias bc='bc -l'

First off, you’ll want to alias bc with the -l command line argument, which loads a bunch of math functions including the following:

s(x) Sine
c(x) Cosine
a(x) Arctangent
l(x) Natural logarithm
e(x) Exponential function
j(n, x) Bessel function

They clearly defined this minimal set with the expectation that you’d extend them, so of course, I did. In ~/.bashrc, I exportd BC_ENV_ARGS=~/.bcrc, and started the following in ~/.bcrc:

/* Config for bc */
scale=39

Scale is a special variable in bc. It determines the number of significant figures that bc will carry in its calculations, and 39 worked well for me. Then I defined some physical constants in the form of k_[a-z]+:

print "\n"
print "Usage: k_[?] with one of: c, g, atm (atmospheric), h, hbar, mu,\n"
print "       ep(silon), e (elementary charge), coulomb, me (electron),\n"
print "       mp, n (avogadro's), b (boltzmann), r (gas), si(gma)\n\n"

k_c = 299792458 /* Speed of Light */
k_g = 6.67384 * 10^-11 /* Universal Gravitation */
k_atm = 100325 /* Atmospheric pressure */
k_h = 6.62606957 * 10^-34 /* Planck's constant */
k_hbar = 1.054571726 * 10^-34 /* H Bar */
k_mu = 1.256637061 * 10^-6 /* Vacuum permeability */
k_ep = 8.854187817 * 10^-12 /* Vacuum permittivity */
k_epsilon = 8.854187817 * 10^-12 /* Vacuum permittivity */
k_e = 1.602176565 * 10^-19 /* Elementary charge */
k_coulomb = 8.987551787 * 10^9 /* Coulomb's constant */
k_me = 9.10938294 * 10^-31 /* Rest mass of an electron */
k_mp = 1.672621777 * 10^-27 /* Rest mass of a proton */
k_n = 6.02214129 * 10^23 /* Avogadro's number */
k_b = 1.3806488 * 10^-23 /* Boltzmann's constant */
k_r = 8.3144621 /* Ideal gas constant */
k_si = 5.670373 * 10^-8 /* Stefan-Boltzmann constant */
k_sigma = 5.670373 * 10^-8 /* Stefan-Boltzmann constant */

pi = 3.1415926535897932384626433832795028841968 /* pi */

And then these:

print "Usage: s(x) c(x) t(x) as(x) ac(x) at(x)\n"
print "       csc(x) sec(x) cot(x) e(x)\n\n"
define t(x) { return s(x)/c(x); }
define as(x) { return 2*a(x/(1+sqrt(1-x^2))); }
define ac(x) { return 2*a(sqrt(1-x^2)/(1+x)); }
define at(x) { return a(x); }
define csc(x) { return 1/s(x); }
define sec(x) { return 1/c(x); }
define cot(x) { return c(x)/s(x); }

I printed the default mathlib functions in the Usage for completeness. Stuff like sqrt() works too. I use bc primarily for physics homework, since built-in GUI calculators often have better base-conversion and visualization abilities for programming.