About 14 months ago, The Daily Californian switched from Disqus to Facebook Comments in order to clean up the comments section of the website. But recently, the decision was reversed, and it was decided that Disqus would make a come back. I worked with one of my Online Developers to facilitate the process.
One of the stipulations of the Facebook to Disqus transition was that the Online Team would transfer a year’s worth of Facebook comments over to Disqus so that they would still remain on the website. Facebook doesn’t natively support any kind of data export feature for their commenting platform at the time of writing. I don’t expect that they will in the future. However, they provide a reliable API to their comments database. With a bit of ingenuity and persistence, we were able to successfully import over 99% of existing Facebook comments into Disqus.
Overview of the process
Disqus supports imports in the form of custom WXR files. These are WordPress-compatible XML files that contain things about posts (titles, dates, preview, id, etc.) and comments (name, date, IP, content, etc.).
The Daily Cal uses WordPress and the official Disqus WordPress plugin. The plugin identifies threads with a combination of WordPress’s internal post ID and a short permalink. Thread identifiers look like this:
var disqus_identifier = '528 https://rogerhub.com/~r/code.rogerhub/?p=528';
This one is taken right from the source code of this post (you can see for yourself).
Facebook, on the other hand, identifies threads by the content page’s URL. After all, their system was created for arbitrary content, not just blogs. The Facebook Open Graph API provides a good amount of information about comments. There’s enough information to identify multiple comments posted by a single user. There’s accurate timestamp and reply relationships. There isn’t any personal information like IP addresses, but names are provided.
The overall process looked like this:
- On dailycal.org, we needed an API endpoint to grab page URLs along with other information about threads on the site.
- For each of these URLs, we check Facebook’s Open Graph API for comments that were posted on that URL. If there are any, then we put them into our local database.
- After we are done processing comments for all of the articles ever published on dailycal.org, we can export them to Disqus-compatible WXR and upload them to Disqus.
This seems like a pretty straight-forward data hacking project, but there were a couple of issues that we ran into.
The primary API endpoint for Facebook’s Comment API is graph.facebook.com/comments/. This takes a couple of GET parameters:
- limit — A maximum number of comments to return
- ids — A comma-delimited list of article URLs
- fields — For getting comment replies
The API supports pagination with cursors (next/prev URLs), but to make things more simple, we just hardcoded a limit parameter of 9999. By default, the endpoint will return only top-level comments. To get comment replies, you need to add a fields=comments parameter to the request.
You can make a few sample requests to get a feel for the structure of the JSON data returned.
Disqus supports a well-documented XML-based import format. In our case, we decided that names were sufficient identification, although Disqus will support some social login tagging in the import format as well. The format specifies a place for the article content, which is an odd request, since article content is usually quite massive. We decided to supply just an excerpt of the article content rather than the entire page.
There were a few more precautions we took before we started development. In order to lower suspicion around our scraping activity on Facebook as well as our own web application firewall (WAF), we grabbed the user agent of a typical Google Chrome client running on Windows 7 x86-64, and used that for all of our requests. We also created a couple of test import “sites” on Disqus, since the final destination of our generated WXR was the Disqus site that we used a year ago before the switch to Facebook. There isn’t any way to copy comments or clone a Disqus site, so we didn’t want to make any mistakes.
Unicode support and escape sequences
The first version of our program had terrible support for any kind of non-ASCII character. It’s not that our commenters were all typing in Arabic or something (actually, we had a couple of comments that really were in Arabic). Smart quotes are used in plain English, and they ruin the process as well.
Facebook’s API spits out JSON data, and uses JSON’s encoding. For example, the double left quotation mark, otherwise known as lrquo in HTML/XML, is encoded as \u201c using JSON’s unicode standard.
However, the JSON data also contains HTML-encoded entities like &. (Update: It appears that Facebook has corrected this issue.)
Python’s JSON library will take care of JSON escape sequences as it decodes the string into a native dictionary. However, the script applies HTML entity decoding on that result, in case there are any straggling escape sequences left. Since Disqus’s WXR format suggests that you throw the comment content into a CDATA block, all you need to escape is the CDATA ending sequence, ]]>. You can do this by splitting it up into 2 CDATA sections (e.g. ]]]><![CDATA>)
Our API endpoint would timeout or throw an error every once in a while. To make our scraper more robust, we set up handlers for HTTP error responses. The scraper would retry an API request at least 5 times before giving up. If none of the attempts are successful, the URL is logged to the database for further debugging.
Commenter email addresses
Every commenter listed in the WXR needs an email address. Comments with the same email address will be tied together, and if somebody ever registers a Disqus account with that email address, they can claim the comments as their own (and edit/delete them). Facebook provides a unique ID that will associate multiple comments by the same person. But since Facebook comments also allows AOL and Yahoo! logins, not every comment has such an ID. Our script used the Facebook-provided ID when it was present, and generated a random one outside of Facebook’s typical range when it wasn’t. All of the emails ended with @dailycal.org, which meant that we would retain control over the registration verification, in case we needed it.
Disqus requires that comments be at least 2 characters long. There were a couple of Facebook comments that consisted of just one word: “B” or “G” or the like. These had to be filtered out before the XML export process.
We also ran into a case where a visitor commented “B<newline>” on the Facebook comments. For Disqus, this still counts as one character, since the CDATA body is stripped of leading and trailing whitespace before processing. The first version of our script didn’t strip whitespace before checking the length, so it failed to filter out this erroneous comment.
After a couple of successful trials, we created a test site in Disqus and imported a sample of the generated WXR. Everything looked good until we went and cross-referenced Disqus’s data with the Facebook comments that were displayed on the site. The comment times appeared to be around 5 hours off!
Here in California, we’re GMT -0800, so there wasn’t a clear explanation why the comment times were delayed. WXR specified GMT times, which we verified were correct. The bug seemed to be on Disqus’s end. We contacted Disqus support and posted on their developer’s mailing list, but after around a week, we decided that it would be easiest to just counteract the delay with an offset in the opposite direction.
import datetime time_offset = datetime.timedelta(0, -5*60*60) comment_date = (datetime.datetime.strptime( comment['created_time'], "%Y-%m-%dT%H:%M:%S+0000" ) + time_offset).strftime("%Y-%m-%d %H:%M:%S")
After a dozen successful test runs, we pulled the trigger and unloaded the WXR onto the live Disqus site. The import process finished within 5 minutes, and everything worked without a hitch.