Google’s N-Gram cache brings their level of near-omniscience–and in particular their knowledge about how the use of language informs human interaction with Search Engines–to a new level. Human language and human behavior (re: consumer behavior) intersect in interesting ways on the Internet, and Google has long been established as the industry leader in mapping and manipulating the site of this interaction. Cultural theorists have, for a long time, been writing ‘prophetic’ essays about how the Internet is a kind of incarnation of collective memory or a representation of collective consciousness. Google’s new N-Gram cache & viewer consummates that kind of pipe-dream in some interesting new ways. At present Google’s N-Gram cache is mostly interesting on a scholarly level–it will not immediately influence the way that businesses compete for Search Engine Rankings. But it gives us some insight into the scope of Google’s long-term ambitions, and for that reason, I think its worth a blog-post.
The N-Gram viewer allows users to search the rising and falling frequency of words as they appear in print over the last five hundred years. Search can be narrowed to any period of years in the past five hundred years, so you can search levels of word-usage from 1500 to present or you can search within a shorter period. For example, how often did the word Reagan appear in print between 1980 to 1988?
Well, certainly more frequently than it had appeared in the preceding 500 years. No great surprise there. The use of the word ‘Reagan’ begins to pick up in the mid-60’s and it spikes steeply in the 1980’s. (In fact, the word Reagan appeared in print more frequently than the words ‘Jesus Christ’ from 1980 until mid-year 2000. Go ahead, take a look.) The word Bush fared better than Reagan in the early centuries of Early Modern era, experiencing occasional spikes in usage. However that probably has more to do with the word for shrubbery appearing at the beginning of sentences than it has to do with certain members of the oil-dynasty from Texas, some of whom have been promoted or elected to various high positions in the United States government in the past 30 years.
Below I’ve called up a comparative n-gram (or ‘bi-gram’) of the words ‘God’ and ‘money,’ spanning the past five hundred years.
As we can see usage of the pronoun ‘God’ in print peaked during the late 1600’s through the early 1700’s, and at the end of the 18th century it began a precipitous decline, the frequency of its usage gradually approaching an almost perfect statistical convergence with the word ‘money’ not too long after the Industrial Revolution. The usage of the word ‘God’ in print remains at a frequency slightly higher than the word ‘money’ in our present decade.
The appearance of the words ‘Angelina Jolie’ in print, surpassed the prevalence of the words ‘War in Afghanistan’ in early 2002, by a margin that has been growing consistently since that time.
To assemble their N-Gram cache, Google scanned 10% of all books ever published. That’s one out of every ten books, dating back to the invention of the printing press. That’s an impressive sample and it will allow Google to map the evolution of language in print-form in amazing ways. This, presumably, will ultimately inform the ‘discernment’ of their algorithm in ways and by means that I am not qualified even to hypothesize about.
It’s interesting that the N-Gram and Google’s Reading-Level filter came out in the same week. At this point, the reading-level filter is not informed by data from the n-gram cache (the reading level filter is informed by a group of teachers who graded sites along specific criteria), but we can imagine that as that tool becomes more nuanced, some bandwidths of data from the N-Gram may begin to come into play, framing the way that Google reads websites, and how we, in turn, encounter the written word.
Fun fact: Did you know that in order to harvest the parchment (sheep-skin) to produce one copy of the first print run of the Gutenberg Bible (the 1st book ever printed) 300 sheep had to be slaughtered?! In intervening years, with the invention of blogs and so forth, the dissemination of text to an audience has become much less costly!











