Code.RogerHub » data https://rogerhub.com/~r/code.rogerhub The programming blog at RogerHub Fri, 27 Mar 2015 23:04:04 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.2 Archiving and compressing a dynamic web application https://rogerhub.com/~r/code.rogerhub/programming/508/archiving-and-compressing-a-dynamic-web-application/ https://rogerhub.com/~r/code.rogerhub/programming/508/archiving-and-compressing-a-dynamic-web-application/#comments Sun, 01 Sep 2013 05:04:22 +0000 https://rogerhub.com/~r/code.rogerhub/?p=508 From 1999 to mid-2011, the Daily Cal used an in-house CMS to run the website. It contains around 65,000 individual articles and thousands of images. But ever since we moved to WordPress, the old system has been collecting dust in the corner. It was about time that all of the old CMS content was archived as static HTML so that it could be served indefinitely in the future as server software evolves. To accomplish this, I set up a linux virtual machine with my trusty vagrant up utility on my spare home server.

Retrieving the application data

The CMS’s production server did not actually have enough free disk space to create an tarball copy of the application. But since there were on the order of 10,000 files involved, a recursive copy via scp would be too slow. In order to speed up the transfer process, I used a little gnu/linux philosophy and piped a few things together:

$ screen
$ ssh root@domain "tar -c /srv/archive" | tar xvf -

I decided that compression was not going to be very helpful because most of the data was jpeg and png images, which are not very compressible. Enabling compression would just slow things down, since the bottleneck would become the CPU rather than the network.

Preparing the application

The CMS had very few dependencies, which is not surprising given the state of PHP 10 years ago. I set up a simple nginx+php-fpm+mysql configuration with a single PHP worker thread. The crawl operation would be executed in serial anyway, so multiple workers would not be useful.

I also added an entry to the VM’s hosts file for the hostname of the production server. The server’s hostname was hardcoded in a few places, and I didn’t want the crawl operation sending requests out to Internet. Additionally, I set up a secondary web server configuration that served generated HTML from the output directory and static assets from the application data, so that I could preview the results as they were being generated.

Crawling the application

Generating the static pages was the hardest part of the archival process, and it took me around 5 retries to get it right. The primary purpose of the whole archival operation was to remove the dynamic portions of the web application. This meant that static versions of every single conceivable page request had to be run against the application and saved. I picked wget as my archival tool.

Articles in the CMS were stored in one of two places. However, the format of the URL was luckily the same for both. I dumped article IDs from the MySQL database and created a seed file of article URLs:

$ echo "SELECT article_id FROM dailycal.article;" | mysql -u root > article_ids
$ echo "SELECT id FROM dailycal.h_articles;" | mysql -u root  >> article_ids
$ sed -e 's/^/http:\/\/archive.dailycal.org\/article.php?id=/' -i article_ids

I didn’t see much point in setting a root password for the local MySQL installation, since this was a single-use VM anyway. Sed ate through the 65,326 article IDs in seconds. I then created a second seed file containing just the URL of the application root, from where (nearly) all other pages would be crawlable.

On the crawler’s final run, I set the following command line switches:

  • --trust-server-names – Sets the output file name according to the last request in a redirection chain. By default, wget uses the first request.
  • --append-output=generator.log – Outputs progress information to a file, so that I can run the main process in screen and monitor it with tail in follow mode.
  • --input-file=source.txt – Specifies the seed file of URLs.
  • --timestamping – Sets the file modification time according to HTTP headers, if available.
  • --no-host-directories – Disables the creation of per-host directories.
  • --default-page=index.php – Defines the name for request paths that end in a slash.
  • --html-extension – Appends the html file extension to all generated pages, even if another extension already exists.
  • --domains archive.dailycal.org – Restricts the crawl to only the application domain.

Additionally, I set the following switches to crawl through the links on the application’s root page.

  • -r – Enables recursive downloading.
  • --mirror – Sets infinite recursion depth and other useful options.

In total, 543MB of non-article HTML and 2.1GB of article HTML were generated. These are reasonable sizes, given how many URLs were crawled in total, but they were still a bit unwieldy to store. I looked for a solution.

Serving from a compressed archive

I knew a couple of data facts about the generated HTML:

  1. There was tons of redundancy. The articles shared much of their header and footer markup.
  2. Virtually all of the data consisted of printable characters and whitespace, which means considerably less unique information than 8 bits per byte.

Both of these factors made the HTML a good candidate for archive compression. My first thought was tar+gzip, but tar+gzip compression works on blocks, not files. To extract a single file, you’d need to parse all the data up till that file. A request for the last file in the archive could take 15 to 20 seconds!

Luckily, the zip file format maintains an index of individual files and compresses them individually, which means that single file extraction is instantaneous no matter where in the archive it is located. I opted to use fuse-zip, an extension for FUSE that lets you mount zip files onto the file system. Fully compressed, the 543MB of pages became a 92MB zip archive (83% deflation), and the 2.1GB of articles became a 407MB zip archive (81% deflation).

After everything was finished, I uploaded the newly created HTML archives to a new production server and shut down the old CMS. From there, a decade’s worth of archived articles can be maintained indefinitely for the future.

]]>
https://rogerhub.com/~r/code.rogerhub/programming/508/archiving-and-compressing-a-dynamic-web-application/feed/ 0
Anti-fraud Detection in Best of Berkeley https://rogerhub.com/~r/code.rogerhub/programming/135/anti-fraud-detection-in-best-of-berkeley/ https://rogerhub.com/~r/code.rogerhub/programming/135/anti-fraud-detection-in-best-of-berkeley/#comments Thu, 11 Apr 2013 08:02:53 +0000 https://rogerhub.com/~r/code.rogerhub/?p=135 Near the end of Spring Break, I helped build the back-end for the Daily Cal’s Best of Berkeley voting website. The awards are given to restaurants and organizations chosen via public voting over the period of a week. Somewhere during the development, we decided it’d be more effective to implement fraud detection rather than prevention. The latter would only encourage evasion and resistance while with the former, we could sit idle and track fraud as it occurred. It made for some interesting statistics too.

graph2

This first one’s simple. One of the candidates earned a highly suspicious number of submissions where they were the only choice selected in any of the 39 categories and 195 total candidates. Our fraud-recognition system aggregated sets of submission choices and raised alerts when abnormal numbers of identical choices appeared. The graph shows the frequency of submissions where only this candidate was selected and demonstrates the regularity and nonrandomness of these fraudulent entries.

Combined with data from tracking cookies and user agents, it’s safe to say that these submissions could be cast out.

graph1

The system also calculated and analyzed the elapsed time between successive submissions. It alerted both for abnormal volume, when a large number of submissions were received in a small time, and for abnormal regularity, when submissions came in at abnormally regular intervals. From the graph, it looks like it takes about 10.5 to 12 seconds for the whole process: reload, scroll, check vote, scroll, submit.

The calculations for this alert were a bit trickier than I expected. At first, I thought of using a queue where old entries would be discarded:

s := queue()
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold:
    send_alert
    s->clear

This doesn’t work very well. Each time the queue length exceeded the threshold, it would flush the queue and notify that threshold submissions were detected in abnormal volume. So, I added a minimum time-out before another alert would be raised.

s := queue()
last_alert := 0
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold and now - last_alert > threshold:
    send_alert
    s->clear
    last_alert := now

The regularity detector performed a similar task, except it would store each of the time differences in a list, sort it, and then run with a smaller threshold (around 0.3 seconds). Ideally, observations about true randomness suggest that these bars should be more or less horizontal, but this is hardly the case. After these fraudulent entries were removed, this particular candidate was left with a paltry 70-some votes, about 5% of its pre-filtered count.

]]>
https://rogerhub.com/~r/code.rogerhub/programming/135/anti-fraud-detection-in-best-of-berkeley/feed/ 0