Anti-fraud Detection in Best of Berkeley

Near the end of Spring Break, I helped build the back-end for the Daily Cal’s Best of Berkeley voting website. The awards are given to restaurants and organizations chosen via public voting over the period of a week. Somewhere during the development, we decided it’d be more effective to implement fraud detection rather than prevention. The latter would only encourage evasion and resistance while with the former, we could sit idle and track fraud as it occurred. It made for some interesting statistics too.

This first one’s simple. One of the candidates earned a highly suspicious number of submissions where they were the only choice selected in any of the 39 categories and 195 total candidates. Our fraud-recognition system aggregated sets of submission choices and raised alerts when abnormal numbers of identical choices appeared. The graph shows the frequency of submissions where only this candidate was selected and demonstrates the regularity and nonrandomness of these fraudulent entries.

Combined with data from tracking cookies and user agents, it’s safe to say that these submissions could be cast out.

The system also calculated and analyzed the elapsed time between successive submissions. It alerted both for abnormal volume, when a large number of submissions were received in a small time, and for abnormal regularity, when submissions came in at abnormally regular intervals. From the graph, it looks like it takes about 10.5 to 12 seconds for the whole process: reload, scroll, check vote, scroll, submit.

The calculations for this alert were a bit trickier than I expected. At first, I thought of using a queue where old entries would be discarded:

s := queue()
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold:
    send_alert
    s->clear

This doesn’t work very well. Each time the queue length exceeded the threshold, it would flush the queue and notify that threshold submissions were detected in abnormal volume. So, I added a minimum time-out before another alert would be raised.

s := queue()
last_alert := 0
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold and now - last_alert > threshold:
    send_alert
    s->clear
    last_alert := now

The regularity detector performed a similar task, except it would store each of the time differences in a list, sort it, and then run with a smaller threshold (around 0.3 seconds). Ideally, observations about true randomness suggest that these bars should be more or less horizontal, but this is hardly the case. After these fraudulent entries were removed, this particular candidate was left with a paltry 70-some votes, about 5% of its pre-filtered count.