Code.RogerHub » Programming

Exporting comments from Facebook Comments to Disqus

Roger Chen — Sun, 16 Mar 2014 05:37:32 +0000

About 14 months ago, The Daily Californian switched from Disqus to Facebook Comments in order to clean up the comments section of the website. But recently, the decision was reversed, and it was decided that Disqus would make a come back. I worked with one of my Online Developers to facilitate the process.

One of the stipulations of the Facebook to Disqus transition was that the Online Team would transfer a year’s worth of Facebook comments over to Disqus so that they would still remain on the website. Facebook doesn’t natively support any kind of data export feature for their commenting platform at the time of writing. I don’t expect that they will in the future. However, they provide a reliable API to their comments database. With a bit of ingenuity and persistence, we were able to successfully import over 99% of existing Facebook comments into Disqus.

Overview of the process

Disqus supports imports in the form of custom WXR files. These are WordPress-compatible XML files that contain things about posts (titles, dates, preview, id, etc.) and comments (name, date, IP, content, etc.).

The Daily Cal uses WordPress and the official Disqus WordPress plugin. The plugin identifies threads with a combination of WordPress’s internal post ID and a short permalink. Thread identifiers look like this:

var disqus_identifier = '528 https://rogerhub.com/~r/code.rogerhub/?p=528';

This one is taken right from the source code of this post (you can see for yourself).

Facebook, on the other hand, identifies threads by the content page’s URL. After all, their system was created for arbitrary content, not just blogs. The Facebook Open Graph API provides a good amount of information about comments. There’s enough information to identify multiple comments posted by a single user. There’s accurate timestamp and reply relationships. There isn’t any personal information like IP addresses, but names are provided.

The overall process looked like this:

On dailycal.org, we needed an API endpoint to grab page URLs along with other information about threads on the site.
For each of these URLs, we check Facebook’s Open Graph API for comments that were posted on that URL. If there are any, then we put them into our local database.
After we are done processing comments for all of the articles ever published on dailycal.org, we can export them to Disqus-compatible WXR and upload them to Disqus.

This seems like a pretty straight-forward data hacking project, but there were a couple of issues that we ran into.

Nitty-gritty details

The primary API endpoint for Facebook’s Comment API is graph.facebook.com/comments/. This takes a couple of GET parameters:

limit — A maximum number of comments to return
ids — A comma-delimited list of article URLs
fields — For getting comment replies

The API supports pagination with cursors (next/prev URLs), but to make things more simple, we just hardcoded a limit parameter of 9999. By default, the endpoint will return only top-level comments. To get comment replies, you need to add a fields=comments parameter to the request.

You can make a few sample requests to get a feel for the structure of the JSON data returned.

Disqus supports a well-documented XML-based import format. In our case, we decided that names were sufficient identification, although Disqus will support some social login tagging in the import format as well. The format specifies a place for the article content, which is an odd request, since article content is usually quite massive. We decided to supply just an excerpt of the article content rather than the entire page.

There were a few more precautions we took before we started development. In order to lower suspicion around our scraping activity on Facebook as well as our own web application firewall (WAF), we grabbed the user agent of a typical Google Chrome client running on Windows 7 x86-64, and used that for all of our requests. We also created a couple of test import “sites” on Disqus, since the final destination of our generated WXR was the Disqus site that we used a year ago before the switch to Facebook. There isn’t any way to copy comments or clone a Disqus site, so we didn’t want to make any mistakes.

Unicode support and escape sequences

The first version of our program had terrible support for any kind of non-ASCII character. It’s not that our commenters were all typing in Arabic or something (actually, we had a couple of comments that really were in Arabic). Smart quotes are used in plain English, and they ruin the process as well.

Facebook’s API spits out JSON data, and uses JSON’s encoding. For example, the double left quotation mark, otherwise known as lrquo in HTML/XML, is encoded as \u201c using JSON’s unicode standard. ~~However, the JSON data also contains HTML-encoded entities like &.~~ (Update: It appears that Facebook has corrected this issue.)

Python’s JSON library will take care of JSON escape sequences as it decodes the string into a native dictionary. However, the script applies HTML entity decoding on that result, in case there are any straggling escape sequences left. Since Disqus’s WXR format suggests that you throw the comment content into a CDATA block, all you need to escape is the CDATA ending sequence, ]]>. You can do this by splitting it up into 2 CDATA sections (e.g. ]]]>)

HTTP exceptions

Our API endpoint would timeout or throw an error every once in a while. To make our scraper more robust, we set up handlers for HTTP error responses. The scraper would retry an API request at least 5 times before giving up. If none of the attempts are successful, the URL is logged to the database for further debugging.

Commenter email addresses

Every commenter listed in the WXR needs an email address. Comments with the same email address will be tied together, and if somebody ever registers a Disqus account with that email address, they can claim the comments as their own (and edit/delete them). Facebook provides a unique ID that will associate multiple comments by the same person. But since Facebook comments also allows AOL and Yahoo! logins, not every comment has such an ID. Our script used the Facebook-provided ID when it was present, and generated a random one outside of Facebook’s typical range when it wasn’t. All of the emails ended with @dailycal.org, which meant that we would retain control over the registration verification, in case we needed it.

Edge cases

Disqus requires that comments be at least 2 characters long. There were a couple of Facebook comments that consisted of just one word: “B” or “G” or the like. These had to be filtered out before the XML export process.

We also ran into a case where a visitor commented “B” on the Facebook comments. For Disqus, this still counts as one character, since the CDATA body is stripped of leading and trailing whitespace before processing. The first version of our script didn’t strip whitespace before checking the length, so it failed to filter out this erroneous comment.

Timezone issues

After a couple of successful trials, we created a test site in Disqus and imported a sample of the generated WXR. Everything looked good until we went and cross-referenced Disqus’s data with the Facebook comments that were displayed on the site. The comment times appeared to be around 5 hours off!

Here in California, we’re GMT -0800, so there wasn’t a clear explanation why the comment times were delayed. WXR specified GMT times, which we verified were correct. The bug seemed to be on Disqus’s end. We contacted Disqus support and posted on their developer’s mailing list, but after around a week, we decided that it would be easiest to just counteract the delay with an offset in the opposite direction.

import datetime
time_offset = datetime.timedelta(0, -5*60*60)
comment_date = (datetime.datetime.strptime(
    comment['created_time'], "%Y-%m-%dT%H:%M:%S+0000"
    ) + time_offset).strftime("%Y-%m-%d %H:%M:%S")

Conclusion

After a dozen successful test runs, we pulled the trigger and unloaded the WXR onto the live Disqus site. The import process finished within 5 minutes, and everything worked without a hitch.

How to write fast-rendering HTML and CSS

Roger Chen — Sun, 24 Nov 2013 06:04:41 +0000

RogerHub’s most popular page contains over 50,000 HTML nodes. It’s no easy task for a browser to chew through all of that HTML and render it quickly. However, most modern browsers are now able to render that particular page without any performance issues, including mobile and tablet web browsers. When it comes to website performance, most people are concerned with app server response time or JavaScript performance. There really aren’t a lot of things that you can do to make gigantic pages load more quickly, but here are some tricks I learned along the way.

Cut down on the CSS3

Drop shadows and gradients look great when you use them correctly, but when your document is 600,000px tall, they create serious browser lag no matter how modern your hardware may be. You get the fastest renderer performance with simple solid colors. Aesthetics are a small sacrifice when you’re trying to squeeze more speed out of your markup.

Hide unnecessary things

I keep all of my calculator’s comments on a single page because 1) that’s the way it has always been, and 2) SEO. However, many users never even look at those comments. It improves browser performance if you simply instruct the browser not to display most of the markup until they are requested, which brings me to the next point..

Limit the number of tree children

Rather than applying a universal .comment-hidden class to all older hidden comments, put hidden elements under a single unifying parent and apply styles to the parent instead. It’s much faster to style one parent element than a thousand children.

Fake it

The comments on RogerHub no longer support Gravatar-based photos, since literally hundreds of unique avatars were being requested and rendered on each page load. Since I didn’t want to take out the images entirely, everybody now gets a generic anonymous avatar. Much of the original comment markup has also been taken out, leaving only the bare necessities required to properly apply styles.

Use simple, non-nested classes

I don’t have any hard data behind this practice, but intuitively I feel that simple unnested classes are the fastest way to go about styling an arbitrary number of elements. Avoid tag selectors, deeply nested selectors, wildcard selectors, attribute/value selectors, etc.

Archiving and compressing a dynamic web application

Roger Chen — Sun, 01 Sep 2013 05:04:22 +0000

From 1999 to mid-2011, the Daily Cal used an in-house CMS to run the website. It contains around 65,000 individual articles and thousands of images. But ever since we moved to WordPress, the old system has been collecting dust in the corner. It was about time that all of the old CMS content was archived as static HTML so that it could be served indefinitely in the future as server software evolves. To accomplish this, I set up a linux virtual machine with my trusty vagrant up utility on my spare home server.

Retrieving the application data

The CMS’s production server did not actually have enough free disk space to create an tarball copy of the application. But since there were on the order of 10,000 files involved, a recursive copy via scp would be too slow. In order to speed up the transfer process, I used a little gnu/linux philosophy and piped a few things together:

$ screen
$ ssh root@domain "tar -c /srv/archive" | tar xvf -

I decided that compression was not going to be very helpful because most of the data was jpeg and png images, which are not very compressible. Enabling compression would just slow things down, since the bottleneck would become the CPU rather than the network.

Preparing the application

The CMS had very few dependencies, which is not surprising given the state of PHP 10 years ago. I set up a simple nginx+php-fpm+mysql configuration with a single PHP worker thread. The crawl operation would be executed in serial anyway, so multiple workers would not be useful.

I also added an entry to the VM’s hosts file for the hostname of the production server. The server’s hostname was hardcoded in a few places, and I didn’t want the crawl operation sending requests out to Internet. Additionally, I set up a secondary web server configuration that served generated HTML from the output directory and static assets from the application data, so that I could preview the results as they were being generated.

Crawling the application

Generating the static pages was the hardest part of the archival process, and it took me around 5 retries to get it right. The primary purpose of the whole archival operation was to remove the dynamic portions of the web application. This meant that static versions of every single conceivable page request had to be run against the application and saved. I picked wget as my archival tool.

Articles in the CMS were stored in one of two places. However, the format of the URL was luckily the same for both. I dumped article IDs from the MySQL database and created a seed file of article URLs:

$ echo "SELECT article_id FROM dailycal.article;" | mysql -u root > article_ids
$ echo "SELECT id FROM dailycal.h_articles;" | mysql -u root  >> article_ids
$ sed -e 's/^/http:\/\/archive.dailycal.org\/article.php?id=/' -i article_ids

I didn’t see much point in setting a root password for the local MySQL installation, since this was a single-use VM anyway. Sed ate through the 65,326 article IDs in seconds. I then created a second seed file containing just the URL of the application root, from where (nearly) all other pages would be crawlable.

On the crawler’s final run, I set the following command line switches:

--trust-server-names – Sets the output file name according to the last request in a redirection chain. By default, wget uses the first request.
--append-output=generator.log – Outputs progress information to a file, so that I can run the main process in screen and monitor it with tail in follow mode.
--input-file=source.txt – Specifies the seed file of URLs.
--timestamping – Sets the file modification time according to HTTP headers, if available.
--no-host-directories – Disables the creation of per-host directories.
--default-page=index.php – Defines the name for request paths that end in a slash.
--html-extension – Appends the html file extension to all generated pages, even if another extension already exists.
--domains archive.dailycal.org – Restricts the crawl to only the application domain.

Additionally, I set the following switches to crawl through the links on the application’s root page.

-r – Enables recursive downloading.
--mirror – Sets infinite recursion depth and other useful options.

In total, 543MB of non-article HTML and 2.1GB of article HTML were generated. These are reasonable sizes, given how many URLs were crawled in total, but they were still a bit unwieldy to store. I looked for a solution.

Serving from a compressed archive

I knew a couple of data facts about the generated HTML:

There was tons of redundancy. The articles shared much of their header and footer markup.
Virtually all of the data consisted of printable characters and whitespace, which means considerably less unique information than 8 bits per byte.

Both of these factors made the HTML a good candidate for archive compression. My first thought was tar+gzip, but tar+gzip compression works on blocks, not files. To extract a single file, you’d need to parse all the data up till that file. A request for the last file in the archive could take 15 to 20 seconds!

Luckily, the zip file format maintains an index of individual files and compresses them individually, which means that single file extraction is instantaneous no matter where in the archive it is located. I opted to use fuse-zip, an extension for FUSE that lets you mount zip files onto the file system. Fully compressed, the 543MB of pages became a 92MB zip archive (83% deflation), and the 2.1GB of articles became a 407MB zip archive (81% deflation).

After everything was finished, I uploaded the newly created HTML archives to a new production server and shut down the old CMS. From there, a decade’s worth of archived articles can be maintained indefinitely for the future.

7 tips for writing better CSS

Roger Chen — Fri, 26 Jul 2013 21:45:49 +0000

CSS stylesheets are a fundamental part of the web, but they are also one of the most neglected parts of modern web applications. Traditional programming languages give you a ton of organizational features: namespaces, classes, scope, blocks, etc. CSS has none of these constructs.

I recently released the WordPress theme that runs this blog on the free public WordPress Theme Repository. The theme repository has very high quality-assurance standards and a team of theme reviewers, who run test suites on each newly submitted theme as well as updates to approved themes. While I was preparing my theme for public release, I picked up on a lot of good CSS tips that helped me improve the maintainability of my theme’s CSS. I hope they will help you too:

0. Use a CSS Preprocessor

Before you continue reading, the first thing you should do is make sure you’re using a CSS preprocessor like SASS. CSS preprocessors are tools that let you write your CSS in a special stylesheet language, and then translate your code into actual CSS before it goes into production. They add programming language features to CSS like variables, nested blocks, mixins (CSS functions), inheritance, and single-line comments.

I choose SASS over other options like LESS or SASS+Compass because:

SASS is highly stable (it’s built in Ruby) and contains zero bullshit.
SASS supports an indentation-based syntax similar to python and yaml. LESS does not.
SASS requires zero configuration, as it should.

It’s easy to set up a loop that re-compiles your SASS when you save changes to it, using the tools I wrote about in my continuous integration post.

1. Stop nesting. Use classes.

When you write CSS, always remember to separate structure and presentation. You should not be adding extraneous elements to your HTML for presentation purposes. This also means that you should disentangle your CSS rules from the structure of your HTML. This is not okay:


  
    
      Welcome to my site!
    

...

// Don't do this!!
body > header
  display: fixed
  top: 0
  hgroup
    h1
      color: white
      font: 700 2.25em/1.5 $serif

As a rule of thumb, you should only nest blocks like this when you mean it. A long selector like body > header div h1 is hardly ever what you want. This CSS will break if you ever move your HTML elements around or add/remove a wrapper anywhere.

Use classes liberally. Your CSS should not look like just an outline of your HTML.

In the long run, classes go a long way towards maintainable CSS. You can always do a recursive search through CSS files to look for class names, but you will have a hard time locating a CSS block like the one above if all you have is the resulting 4-part CSS selector. The above example could be better written like with classes like this:


  
    
      ...
    
...

.site-header
  display: fixed
  top: 0
.site-title
  color: white
  font: 700 2.25em/1.5 $serif

On that note, you should also stop leaving

around in your HTML. I’ll show you how to work around this with CSS Inheritance:

2. Selector inheritance is awesome

Inheritance is a very powerful, underrated feature in SASS. (LESS doesn’t support it in the way that SASS does.) Selector inheritance allows you to make classes that “extend” other classes, like parent and child classes do in OOP. It doesn’t do this by stupidly copying their CSS rules, but by adding extra selectors to the parent class’s CSS blocks. Let me show you an example:

.content
  margin: 1em 0 1.5em
  p
    color: white
    margin-bottom: 1.5em
.summary
  @extend .content
  margin-bottom: 2em

...

// Would compile to...
.content, .summary {
  margin: 1em 0 1.5em;
}
.content p, .summary p {
  color: white;
  margin-bottom: 1.5em;
}
.summary {
  margin-bottom: 2em;
}

Any rule and child-rule applied to the parent selector will also be applied to the child. Child classes can also have their own properties that override the parent’s. Here’s a more realistic example of what you can do with selector inheritance:

You’ve probably run into the clearing float CSS problem before. If you’re not familiar with it, take a look at this sample code:


  
    ...
  
  
    ...
  


...
.container
  border: 1px solid black

.left
  float: left
.right
  float: right

The intended effect is that the container’s 1px black border surrounds both .left and .right, but it appears that only the top border is displayed. The problem arises because floated elements are taken out of flow and now, .container has no height. There are several solutions to this problem, but the most common one involves adding an extra element after both floated div’s and giving it clear: both. Instead of adding extraneous presentation elements to the HTML, you can use selector inheritance:


  ...


...
.after-clear-fix
  // "&" refers to the current selector
  &:after
    clear: both
    content: '.'
    display: block
    height: 0
    visibility: hidden

.container
  @extend .after-clear-fix

If you apply @extend .after-clear-fix to several elements, it will compile to a single long CSS selector whose body contains the clearfix rules, thus reducing redundancy in the final stylesheet:

.after-clear-fix:after,
.container:after,
.content:after,
.search-area:after {
  ...
}

SASS also supports a custom syntax that prevents the original class from being printed. Another powerful SASS feature is the @import directive, which works just like CSS’s import directive, but actually brings in the content of the target stylesheet if it is available locally.

3. Organizing your sass/ directory

Most WordPress themes have just 1 stylesheet that’s located at the theme root. When using SASS, my theme stylesheet is usually generated with SASS, while the SASS source files are located in a separate directory. SASS supports the idea of CSS partials. Partials are SASS files whose names begin with an underscore. They are meant to be imported into other stylesheets, and not compiled directly by themselves (although you can do that if you want).

The @import directive is also partial-aware and will match @import "reset" with a file named _reset.sass.

For larger projects, you should split your CSS into files that reflect the semantic role of the parts of the page they style. Additionally, you will usually have a few extra SASS source files for a CSS reset, JavaScript plugins, color/font variables, and mixins. Here’s my theme’s sass directory as an example:

style.sass — The target of the SASS compiler. WordPress theme metadata is stored in CSS comments, so I put my metadata at the top of this file. This source file also imports the rest of your source files.
_reset.sass — I use Eric Meyer’s CSS reset. It’s available in SASS form on Google.
_variables.sass — It’s useful to store all of your colors and fonts in one place and refer to them by variable names. In a monochrome theme like this one, I used color variables named $blue1 to $blue15 in order of increasing lightness.
_cobalt-global.sass — Buttons, input elements, and WordPress core classes
_cobalt-grid.sass — Grid for desktop and mobile layout
_cobalt-header.sass
_cobalt-primary.sass — The primary content area
_cobalt-secondary.sass — Secondary content like sidebars
_cobalt-footer.sass

You can see that this naming scheme reflects traditional source code a lot more than CSS usually does, and because of the import directive, you don’t sacrifice any performance by splitting your CSS into files like this. One mistake that beginner CSS programmers sometimes make is separating the rules for a single selector into different files. They organize their stylesheets by the semantics of the CSS properties, rather than the semantics of their selectors.

// Don't do this!!
//_colors.sass
.header
  color: black
.content
  color: #CCCCCC
...

//_typography.sass
.header
  font: 700 2.25em/1.5 $serif
.content
  font: 400 1em/1.68 $sans-serif

When you split up your stylesheets like this, you end up having to edit 5 or 6 files even if you’re just working on one part of the site design. Usually there’s also a _mobile.sass thrown into the mix, which is also a bad idea. On the other hand, you don’t want to be writing @media only screen and (max-width: 767px) more than once. You can avoid both of these problems by using SASS mixins:

4. Essential mixins for any project

One of the reasons I don’t like SASS+Compass is that there is too much hand-holding and mixins that you’d never use. Most useful mixins are ones that you can program yourself in a few lines. Here are the essential ones that you might actually find useful:

Mobile Layout Mixin

$mobile1: 767px
$mobile2: 479px

= mobile1
  @media only screen and (max-width: $mobile1)
    @content

= mobile2
  @media only screen and (max-width: $mobile2)
    @content

This is the solution to the mobile layout problem I mentioned in the previous section. Mixins in SASS, as you can see, are defined with the equal sign. The SASS content directive is similar to the Ruby yield directive. Instead of typing out the mobile media query every time you need to use it (or even a lengthy variable interpolating selector), you can use the content directive in SASS 3.2+ to define your responsive mobile styles. Here’s an example of it in action:

.site-navigation
  float: right
  +mobile1
    float: none
  li
    display: inline-block
    +mobile2
      display: block

You should never make a separate SASS source file just for mobile CSS. Instead, mobile styles should be kept with the original non-mobile CSS so that you can see both the original and the overriding properties in a single place.

Vendor Prefix Mixin

= vendor-prefix($name, $argument)
  #{$name}: $argument
  -webkit-#{$name}: $argument
  -moz-#{$name}: $argument
  -ms-#{$name}: $argument
  -o-#{$name}: $argument

A vendor prefix mixin is always useful when when using CSS3 properties like border radius and transitions. However, remember not to depend on CSS3 properties for critical parts of the page. Occasionally, you will need to specify the argument on a separate line if it contains commas:

.button
  $transition: background 0.3s, color 0.1s
  +vendor-prefix(transition, $transition)

That’s it. These two are really the only mixins that you may ever need, and you could have written them yourself in a minute. As a rule of thumb, you should only turn a rule into a mixin if you’re using it at least 3 times, and if a mixin only has 1 property, you might as well use a variable.

5. Prefixes for your classes

Anyone who has worked on a large project has come across an HTML element with a class whose purpose was completely unknown. You want to remove it, but you’re hesitant because it may have some purpose that you’re not aware of. As this happens again and again, over time, your HTML become filled with classes that serve no purpose just because team members are afraid to delete them. The problem is that classes are generally given too many responsibilities in front-end web development. They style HTML elements, they act as JavaScript hooks, they’re added to the HTML for feature detections, they’re used in automated tests, etc.

— engineering.appfolio.com

This quote is from an article about CSS architecture by Philip Walton. He recommends that you prefix classes used by JavaScript hooks with .js- and so on, so that you can change CSS class names without breaking anything. I think it’s a great idea, and I’ve been using these other class prefixes as well: (most of these are for WordPress)

.site- — For classes that apply site-wide like site headers and footers.
.entry- — For classes that apply to a single element, of which there are many.
.nav- — For classes on navigation elements that apply to many entry elements.
.var- — For variants of classes (e.g. var-blue or var-small)

Intent-based class prefixes like these are advantageous over colloquial class names like .next-button or position-based class prefixes .bottom-navigation. They let you form a natural class hierarchy using dashes so you can be sure that class names you create are unique. Above all, you don’t want to use generic class names like .container or .header without prefixes. They don’t convey much information about their intent, and they are easy to reuse accidentally, especially if you plan on using messy nested selectors.

6. Know your CSS shorthands

CSS is a vertical language. Lines are rarely more than 75 characters long, but blocks can span 20 or 30 lines easily. To conserve space on your screen (and the screens of those who are reviewing/maintaining your code), use CSS shorthands liberally. Here are a few essential shorthands forms that you should know by heart:

Box Model Shorthand

Numeric box-model properties like margin, padding, and border offer a shorthand notation. The order is clockwise from top, as demonstrated below:

.button
  padding-top: 2px
  padding-right: 3px
  padding-bottom: 4px
  padding-left: 5px

// Equivalent to:
.button
  padding: 2px 3px 4px 5px

You can also specify just 2 or 3 values instead of all 4. In this case, the missing sides will have the same value as their opposite side:

.button
  margin: 4px 3px
  border-width: 1px 2px 0

  // Equvalent to:
  margin-top: 4px
  margin-bottom: 4px
  margin-right: 3px
  margin-left: 3px

  border-width-top: 1px
  border-width-right: 2px
  border-width-left: 2px
  border-width-bottom: 0

Border Shorthand

Borders have 3 properties that are easily distinguishable. They also offer a shorthand form to write all three in one property. Note that you can’t combine the box-model shorthand with this one:

.button
  border-width: 1px
  border-style: solid
  border-color: #00477D

// Equivalent to:
.button
  border: 1px solid #00477D

Font Shorthand

The font property shorthand is the least-commonly used of these shorthands. It’s a hard one to remember, but it will knock out 3 or 4 lines of CSS every time you use it:

$serif: Georgia, Times, serif

.site-header
  font: 2.25em $serif
  font: 2.25em/1.6 $serif
  font: bold 2.25em/1.6 $serif
  font: 700 italic 2.25em/1.6 $serif

  // Final form is equivalent to:
  font-weight: 700
  font-style: italic
  font-size: 2.25em
  line-height: 1.6
  font-family: $serif

You can omit properties from the font shorthand as shown above. However, it must have at least the font size and the font family declarations in order to work at all. If you’re not familiar with numeric font weights, 700 means bold.

Background

The background shorthand is also rarely used because it is hard to remember. Backgrounds have just 4 properties: color, image, repeat, and position. Specify them in that order:

body
  background: blue url(...) repeat-x center center

  // Equivalent to:
  background-color: blue
  background-image: url(...)
  background-repeat: repeat-x
  background-position: center center

That being said, you should not use these shorthands if you only want to specify the top margin or the font size. CSS shorthands are most effective when most of their properties are utilized.

7. Alphabetize for easy comprehension

People have all sorts of different methods they use to order CSS properties. I tend to favor alphabetical order by property name, even if related properties like top and left are separated. You can take a look and pick which one you like yourself:

// Alphabetical order
.search-button
  @extend .button
  background-color: $blue
  border: none
  color: $white
  cursor: pointer
  font: 300 1em/1.25 $sans-serif
  left: 1em
  padding: 4px 9px
  position: relative
  text-align: center
  text-decoration: none
  top: 0.25em
  +vendor-prefix(border-radius, 0.25em)
  vertical-align: middle

Others prefer to order properties by theme:

// Thematic order
.search-button
  @extend .button

  // Box model
  border: none
  padding: 4px 9px
  +vendor-prefix(border-radius, 0.25em)

  // Positioning
  position: relative
  left: 1em
  top: 0.25em

  // Internal styles
  color: $white
  cursor: pointer
  font: 300 1em/1.25 $sans-serif
  text-align: center
  text-decoration: none
  vertical-align: middle

However you decide to order your CSS properties, the effect is the same. All of the tips listed here are purely for the programmer’s convenience. Presentation can always be achieved without CSS preprocessors or shorthands, but using these tools will make your CSS much more readable and maintainable.

Tumblr’s Phishing Protection Code

Roger Chen — Sat, 20 Apr 2013 22:30:25 +0000

At the top of every Tumblr user’s blog, there’s a piece of JavaScript inserted by Tumblr itself. In general, Tumblr is very generous about the control they give you over your blog’s appearance. They don’t insert any advertisements or enforce any global content other than a Quantcast visitor analytics script, follow/dashboard controls on the upper right, and this script. I’ve posted the first few lines here:

(function(){var a=translated_warning_string;var b=function(d){d=d||window.event;var c=d.target||d.srcElement;if(c.type=="password"){if(confirm(a)){b=function(){}}else{c.value="";return false}}};setInterval(function(){if(typeof document.addEventListener!="undefined"){document.addEventListener("keypress",b,false)}},0)})();

This isn’t particularly clever or difficult to understand, but it does utilize several good ideas in JavaScript programming, so I’d like to go over it. First of all, you’ll notice that this extract is a single line of code that’s surrounded by (function() { ... })();. Javascript treats functions as first-class citizens. They can be declared, reassigned, and called inline. There are several advantages of first-class functions in any dynamically-typed language:

The ability to create anonymous functions is particularly helpful when you don’t want to clog up your global namespace with function names that are only used in one part of your code.
They can be created dynamically using, what’s known as a closure in many languages. Closures take their variables from the environment frame in which they were created, so you can generate them on-the-fly in a loop, in an event handler, or as part of a callback.
Even if you don’t need or want any of the fancy benefits listed so far, it’s useful that local variables declared in first-class functions are destroyed when the function quits. That way, you can use names like a or _ without polluting your global namespace.

I’ve reformatted the statements inside the function here:

var a = translated_warning_string;
var b = function(d){
  d = d||window.event;
  var c = d.target||d.srcElement;
  if (c.type == "password"){
    if(confirm(a)){
      b = function(){}
    } else {
      c.value = "";
      return false
    }
  }
};
setInterval(function(){
  if (typeof document.addEventListener != "undefined"){
    document.addEventListener("keypress",b,false)
  }
}, 0)

The first line references the global translated_warning_string variable that is declared in the blog’s HTML itself. Its contents should hint at the goal of this code. (Although it looks like not all languages are supported yet.) Assigning the variable to a means that it can be reused without making the code too long, but it seems that the variable is only actually referenced once here.

Variable b gets assigned to a function in the same way we manipulated functions earlier. The d = d||window.event; is a metaphor for if d is garbage use window.event instead, or else, leave it alone. It is a handy way to specify a default value in variable assignment when you’re uncertain whether the preferred value will work or not. In this case, d defaults to window.event which is a hack for older versions of Internet Explorer. (more on this later)

I was kidding when I said later. Here’s more on it now. Skip ahead a few lines and peek over at this part of the code:

if (typeof document.addEventListener != "undefined"){
  document.addEventListener("keypress",b,false)
}

The function we declared earlier, b, is used as the event listener for the document’s keypress event. It is triggered whenever someone presses a key while on the webpage (actually a little more complicated than this). Event listeners are functions that are supposed to be called with a single parameter, the event’s event parameter, which contains information about the event that triggered the call. This is not the case in older versions of Internet Explorer, hence the d = d||window.event; statement earlier.

When a keypress happens, the event actually travels down the tree to the target element first, from the root to the leaf (document → html → body → …). It gives each of these elements a chance to “capture” the event and handle it before it reaches its destination. This is known as event capturing, and is specified by the third argument to EventTarget.addEventListener. The other more commonly used and default behavior is to let the event bubble up if the destination doesn’t handle it. The event will bubble up until one of the destination’s ancestors catches it and stops the bubbling.

The code examined here chooses not to capture the event on the way down. (If it did, then all events would have to go through this handler.) The trade-off to not interfering with events that already have keypress handlers is that the behavior can be easily overridden by rogue sites (although this is easily detectable).

Back to where we were, we now see that variable d is an event, so c is its EventTarget, and also a dom-tree node. The variable-assignment-with-default trick is used again here.

if (c.type == "password"){
  if(confirm(a)){
    b = function(){}
  } else {
    c.value = "";
    return false
  }
}

If the event’s source is a password input, raise a confirmation dialog with the text stored in a, the translated confirmation string. If the confirmation passes, then b is set to an empty anonymous function, function(){}. Recall that b was previously defined to be another anonymous function and was also used in our event listener. Clearing this variable after the first successful confirmation prevents the script from prompting the user more than once, which is a good idea.

If the user rejects the confirmation, then the password input is cleared and the keyevent is halted with return false. Note that JavaScript dictates that the last semicolon in a block is optional, although it is usually good style to include it anyway.

Finally, note that the event listener is wrapped inside a setInterval(function() { ... }, 0). setInterval is a built-in Javascript function that runs a piece of code regularly. The second parameter is the delay in between calls, specified in milliseconds. In this case, 0 milliseconds is used, but the web browser imposes a lower limit on this delay.

The function’s contents checks the type of document.addEventListener. This function is part of the DOM that is set up near the beginning of every page load. Once the event infrastructure is available, the listener is attached. A more common way to achieve the same affect is to attach an event listener to the window object’s onload function, which is usually achieved through one of many Javascript libraries, although this solution is appropriate in the context of this script.

Anti-fraud Detection in Best of Berkeley

Roger Chen — Thu, 11 Apr 2013 08:02:53 +0000

Near the end of Spring Break, I helped build the back-end for the Daily Cal’s Best of Berkeley voting website. The awards are given to restaurants and organizations chosen via public voting over the period of a week. Somewhere during the development, we decided it’d be more effective to implement fraud detection rather than prevention. The latter would only encourage evasion and resistance while with the former, we could sit idle and track fraud as it occurred. It made for some interesting statistics too.

This first one’s simple. One of the candidates earned a highly suspicious number of submissions where they were the only choice selected in any of the 39 categories and 195 total candidates. Our fraud-recognition system aggregated sets of submission choices and raised alerts when abnormal numbers of identical choices appeared. The graph shows the frequency of submissions where only this candidate was selected and demonstrates the regularity and nonrandomness of these fraudulent entries.

Combined with data from tracking cookies and user agents, it’s safe to say that these submissions could be cast out.

The system also calculated and analyzed the elapsed time between successive submissions. It alerted both for abnormal volume, when a large number of submissions were received in a small time, and for abnormal regularity, when submissions came in at abnormally regular intervals. From the graph, it looks like it takes about 10.5 to 12 seconds for the whole process: reload, scroll, check vote, scroll, submit.

The calculations for this alert were a bit trickier than I expected. At first, I thought of using a queue where old entries would be discarded:

s := queue()
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold:
    send_alert
    s->clear

This doesn’t work very well. Each time the queue length exceeded the threshold, it would flush the queue and notify that threshold submissions were detected in abnormal volume. So, I added a minimum time-out before another alert would be raised.

s := queue()
last_alert := 0
for each sort_by_time(submission):
  s.add(submission)
  s.remove_entries_before(5 minutes ago)
  if s.length > threshold and now - last_alert > threshold:
    send_alert
    s->clear
    last_alert := now

The regularity detector performed a similar task, except it would store each of the time differences in a list, sort it, and then run with a smaller threshold (around 0.3 seconds). Ideally, observations about true randomness suggest that these bars should be more or less horizontal, but this is hardly the case. After these fraudulent entries were removed, this particular candidate was left with a paltry 70-some votes, about 5% of its pre-filtered count.

Generating on-the-fly filler text in PHP

Roger Chen — Tue, 12 Mar 2013 05:57:34 +0000

Update

I’ve updated the code and text of this post to reflect the latest version of the code.

For one of the projects I’ve been working on recently, I needed huge amounts of filler text (we’re talking about a megabyte) for lorem ipsum placeholder copy. Copy is the journalistic term for plain ol’ text in an article or an advertisement, in contrast with pictures or design elements. When you design for type, it’s often helpful to have text that looks like it could be legitimate writing instead of a single word repeated or purely random characters. From this rose the art of lorem ipsum, which intelligently crafts words with pronounceable syllables and varying lengths.

It’s a rather complicated process to generate high-quality lorem ipsum, but the following will do an acceptable job with much fewer lines of code.

/**
 * Helper function that generates filler text
 *
 * @param $type is target length in characters
 */
protected function filler_text($type) {
  
  /**
   * Source text for lipsum
   */
  static $lipsum_source = array(
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit",
    "Aliquam sodales blandit felis, vitae imperdiet nisl",
    ...
    "Quisque ullamcorper aliquet ante, sit amet molestie magna auctor nec",
  );
  if ($type == 'title') {
    // Titles average 3 to 6 words
    $length = rand(3, 6);
    $ret = "";
    for ($i = 0; $i < $length; $i++) {
      if (!$i) {
        $ret = ucwords($this->array_random_value(explode(" ", strip_tags($this->array_random_value($lipsum_source))))) . ' ';
      } else {
        $ret .= strtolower($this->array_random_value(explode(" ", strip_tags($this->array_random_value($lipsum_source))))) . ' ';
      }
    }
    return trim($ret);
  } else if ($type == 'post') {
    $ret = "";
    $order = array('paragraph');
    $order_length = rand(12, 19);
    for ($n = 0; $n < $order_length; $n++) {
      $choice = rand(0, 8);
      switch ($choice) {
      case 0: $order[] = 'list'; break;
      case 1: $order[] = 'image'; break;
      case 2: $order[] = 'blockquote'; break;
      default: $order[] = 'paragraph'; break;
      }
    }
    for ($n = 0; $n < count($order); $n++) {
      switch ($order[$n]) {
      case 'paragraph':
        $length = rand(2,7);
        $ret .= '';
        for ($i = 0; $i < $length; $i++) {
          if ($i) $ret .= ' ';
          $ret .= $this->array_random_value($lipsum_source) . '.';
        }
        $ret .= "\n";
        break;
      case 'image':
        $ret .= "\n";
        break;
      case 'list':
        $tag = (rand(0, 1)) ? 'ul' : 'ol';
        $ret .= "<$tag>\n";
        $length = rand(2,5);
        for ($i = 0; $i < $length; $i++) {
          $ret .= "" . $this->array_random_value($lipsum_source) . "\n";
        }
        $ret .= "\n";
        break;
      case 'blockquote':
        $length = rand(2,7);
        $ret .= '';
        for ($i = 0; $i < $length; $i++) {
          if ($i) $ret .= ' ';
          $ret .= $this->array_random_value($lipsum_source) . '.';
        }
        $ret .= "\n";
        break;
      }
    }
    
    return $ret;
  }
}

First of all, you’ll need some filler text to use as a seed for this function. The term seed is a heavily-used term in computer science. It usually refers to an initial value used in some sort of deterministic pseudo-random number generator (or PRNG). Most programming languages have built-in libraries that provide randomness-generation. Many of these implementations are not actually random, but deterministic algorithms that are pure functions of some environmental variable, usually the timestamp, and the number of calls to the algorithm preceding it: n − 1 for round n.

The seed in this case is just a bunch of pre-generated lorem ipsum that you can grab anywhere online. The heart of the code just breaks down this text into sentences and picks a number of them to fit into a new sentence.

Lorem ipsum is rarely useful as one enormous chunk of text. Most frequently, copy is broken into paragraphs of varying lengths, which is this next enhancement. The code alternates between paragraphs, images, lists, and blockquotes to keep things more interesting. You can get my post-generating plugin on WordPress.org.