That's kind of what happened yesterday, when I suddenly had to know not just how many words are in my latest book (I already knew that: 128,331) but what the total vocabulary of the book is: how many unique words. OpenOffice doesn't tell you that.
Thus began what should have been a five-minute coding adventure. But if you're a coder, you know I'm talking code-talk here, because five-minute coding adventure actually means hour-long bumbling misadventure leading to multiple head-slap Aha! moments, a few of which I just have to share with you in terms of graphs. Prepare to be amazed. (Fast-forward to the graphs, if you don't speak code.)
Getting the vocab count on my book Of Two Minds wasn't that hard. First, I had OpenOffice output the book as HTML, then I opened it in Firefox and (in a Scratchpad window) typed the following code:
var text = document.body.textContent;
var wordMatch = /\b\w+('|-)?\w*\b/gi; // magic
var theWords = text.match( wordMatch );
var vocab = new Object; // get counts here
for ( var i=0; i != totalMatches; i++ )
if ( theWords[i].toLowerCase() in vocab )
vocab[ theWords[i].toLowerCase() ]++; // bump count
else
vocab[ theWords[i].toLowerCase() ] = 1; // start counting
var results = []; // collect results here
for ( var k in vocab )
results.push( [ k, vocab[k] ] );
The magic is in the regular expression /\b\w+('|-)?\w*\b/gi, which is geek for "match a word boundary, \b, followed by one or more letters (\w+), optionally followed by an apostrophe or hyphen, followed by zero or more (*) word characters \w, followed by a word boundary \b, and do this globally in case-insensitive manner (gi)."
Thus, this regex will match can't as well as cant, but wouldn't've correctly matched wouldn't've, nor overly-hyphenated-expressions (treating each of those as two words). Close enough, though, right?
The other lines of code show how to tally unique words in an Object, which in JS is just a hash table (a table that will essentially let you use 'anyArbitraryText' as a key to store and look up values).
Once I had the results (all unique words) stored in a results array, I decided why not sort the array by word frequency? Bear in mind each array member is, itself, a mini-array of [ word, count ].
function comparator(a,b) {
var count1 = a[1];
var count2 = b[1];
return count2 - count1;
}
return results.sort( comparator );
All of this gets wrapped in a function, returning the sorted results array.
Okay, so now the fun begins, because not only can I look at results.length to get the vocabulary count (which turns out to be huge: over 14,000) but I can now make a graph of word count by rank order. First I should tell you that around 1,000 "junk words" showed up in my initial attempt at a vocab count, due to things like single-letter matches of people's initials in footnotes ("R. L. Smith" gives three "words"), and other mischief, so I later modified the magic regex (inserting a \w) to force recognition of just two-letters-or-longer words. Which gave a vocab count of 12,886.
The top 20 words, by the way (minus one-letter words) are:
Word | Count |
the | 4990 |
of | 3430 |
to | 3237 |
and | 2516 |
in | 2344 |
that | 1654 |
it | 1487 |
you | 1333 |
for | 1325 |
is | 1163 |
was | 1048 |
with | 866 |
on | 850 |
or | 836 |
at | 725 |
as | 659 |
be | 638 |
but | 635 |
not | 623 |
my | 577 |
Geeks already know what's coming next. When you make a graph of word counts versus rank index, you find a curious phenomenon. Harvard linguistics professor George Kingsley Zipf was first to become fixated on the weird fact that the most frequent word, in English (or any other language), will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on, implying a power law, which means (skipping some math here) that if you plot the logarithm of a word's rank against the log of its frequency (count), you should get a straight line. Here's the plot for Of Two Minds:
Zipf plot of word frequency (count) vs. rank order for words in Of Two Minds. Click to enlarge. |
But here's the interesting thing. Since the code was already written (and trivial to modify), I decided to extract the vocabulary of 3-letter words and see if Zipf's Law applied to just those words. The graph I got looked like this:
Zipf plot of 3-letter words in Of Two Minds. |
Zipf plot of 5-letter words in Of Two Minds. |
At this point, drunk with power, I decided to wield my world-destroying code-wrangling abilities in a new way: I looked at the stats for my top 50 Tweets (you can download a spreadsheet of your Twitter statistics, from Twitter), sorted the "impressions" numbers (to rank them), then did a plot of log impressions versus log rank, and guess what?
My top 50 Tweets of all time, plotted as log impressions vs. log rank. |
A classic Zipf Law situation (again)! If anything, the relationship of Tweet traffic to rank order is more Zipf-like than the relationship of word frequency to word rank.
So at this point, I'm thinking: "Okay, I know that if I do the same trick using blog traffic, I'm apt to see a power-law relationship again." But I've also long suspected that a couple of high-traffic posts (with way higher than normal traffic) are acting as honeypots for bots and spiders. Can Zipf's Law be used forensically, to detect out-of-whack numbers in a set of traffic stats?
I did a plot of log traffic vs. log rank for the last 270 posts on this blog:
Log traffic vs. log rank for 270 posts on this blog. |
Zipf Law, a.k.a. power-law (Pareto), distributions are worth taking time to understand, because they underlie so many natural phenomena. Zipf-Pareto distributions accurately describe the size of craters on the moon, word frequencies in languages, sales curves for bestselling books, intensity of solar flares, citations of scientific papers, wealth distribution, city-size distribution, and many other natural distributions. (See this excellent paper for details.) This is a fundamental phenomenon of nature, with huge forensic implications, because (think about it), if Zipf-Pareto describes bestselling book stats, you should be able to detect whether an author is "gaming the system" by looking for books whose sales stand out to an incredible degree. In any system that follows Zipf-Pareto laws, outliers (points that don't fall on the curve) are suspect; they have forensic importance, potentially.
Aren't you glad you took this Detour to Codeville with me? You can join the main highway again now. Coffee break's over. Back to standing on our heads.
Have you joined the mailing list? What are you waiting for?
☙ ❧
I want to thank the following great tweeps for retweeting me yesterday. Click into these profile pics and Follow these people on Twitter! They retweet!