Wandering Wikipedia: Datamining My Firefox History
My friends and I frequently get lost in Wikipedia. I’ll start out searching for something innocuous, like neutrino, and then suddenly I’m learning all about tanning addiction. This happens so often that my girlfriend suggested that it would be fascinating to plot the various trips through Wikipedia by datamining the Firefox history database, and since she is busy with her thesis I stole the idea and spent a few hours writing a Python script to visually display my Wikipedia wanderings.
Firefox 3 stores its history in a SQLite 3 database file in your profile directory; on OS X that database lives in
~/Library/Application Support/Firefox/Profiles/cn3x93q2.default, and the database file we’re interested in is
The history database schema is described here, but the two tables we’re interested in are
moz_historyvisits. The first,
moz_places, has the URL, title and other data related to the links we’ve visited. What it doesn’t have is information on the paths we have a traversed to get to the URLs in
moz_places – that information is in
moz_historyvisists has internal references which let us find out where we’ve been (the column
from_visit) and a reference to the
moz_places table via the
A very talented data architect I know helped write (entirely wrote is maybe more accurate), this query:
curr.id, curr.url, curr.title,
prev.id, prev.url, prev.title,
moz_places curr, moz_places prev,
t.place_id = curr.id AND
frm.place_id = prev.id AND
frm.id = t.from_visit AND
curr.url LIKE 'http://en.wikipedia.org/%' AND
prev.url NOT LIKE 'http://en.wikipedia.org/%'
This query returns all Wikipedia URLs that are the starting points of my journeys through Wikipedia by finding all of the Wikipedia links I’ve visited whose referrer is not Wikipedia itself. With a few changes to the last clauses we can find all the URLs whose referrers are Wikipedia links (ie, the waypoints in my travels through Wikipedia). Finally, by asking for a
curr.url which is not part of Wikipedia but which has a
prev.url that is Wikipedia, we know when we’ve left Wikipedia.
My script outputs graphs in Dot format and JSON. The JSON output is in a representation that is compatible with JIT, a web 2.0 AJAXy graphing library, the output of which you can see in the title graphic of this post.
I’ve put the script up on github and called it FoxyGraph (be kind; it was written in a few hours for a specific purpose and is probably full of bugs). I’ll be updating FoxyGraph later with more interesting visualizations of my Firefox history, but for now you can see the immense clickable web 2.0 hypertree of my Wikipedia wanderings.