During the Watson-related madness earlier this year I wrote a script to put all of the Jeopardy! game data from J! Archive into a SQL database (they have no API so I parsed the very-broken HTML for each game). I never ended up doing anything with the data, but there’s been renewed interest around the house in this data so I took some time to make a word cloud of the Jeopardy! categories.
My friends and I frequently get lost in Wikipedia. I’ll start out searching for something innocuous, like neutrino, and then suddenly I’m learning all about tanning addiction. This happens so often that my girlfriend suggested that it would be fascinating to plot the various trips through Wikipedia by datamining the Firefox history database, and since she is busy with her thesis I stole the idea and spent a few hours writing a Python script to visually display my Wikipedia wanderings.
Firefox 3 stores its history in a SQLite 3 database file in your profile directory; on OS X that database lives in ~/Library/Application Support/Firefox/Profiles/cn3x93q2.default, and the database file we’re interested in is places.sqlite.
The history database schema is described here, but the two tables we’re interested in are moz_places and moz_historyvisits. The first, moz_places, has the URL, title and other data related to the links we’ve visited. What it doesn’t have is information on the paths we have a traversed to get to the URLs in moz_places – that information is in moz_historyvisits. moz_historyvisists has internal references which let us find out where we’ve been (the column from_visit) and a reference to the moz_places table via the place_id column.
A very talented data architect I know helped write (entirely wrote is maybe more accurate), this query:
curr.id, curr.url, curr.title,
prev.id, prev.url, prev.title,
moz_places curr, moz_places prev,
t.place_id = curr.id AND
frm.place_id = prev.id AND
frm.id = t.from_visit AND
curr.url LIKE 'http://en.wikipedia.org/%' AND
prev.url NOT LIKE 'http://en.wikipedia.org/%'
This query returns all Wikipedia URLs that are the starting points of my journeys through Wikipedia by finding all of the Wikipedia links I’ve visited whose referrer is not Wikipedia itself. With a few changes to the last clauses we can find all the URLs whose referrers are Wikipedia links (ie, the waypoints in my travels through Wikipedia). Finally, by asking for a curr.url which is not part of Wikipedia but which has a prev.url that is Wikipedia, we know when we’ve left Wikipedia.
My script outputs graphs in Dot format and JSON. The JSON output is in a representation that is compatible with JIT, a web 2.0 AJAXy graphing library, the output of which you can see in the title graphic of this post.