Friday, October 9, 2015

Fun with Google Ngrams

Google N-Grams is a dataset released by Google a few years back, which is based on Google Books. The dataset is available for multiple languages. All Google books of a certain language were scanned, and the frequency of each n-gram, for n = 1 to 5, was counted by the publish year. For example, how many times did the word "no" occur in books from 1800 to 1900? How many times did the trigram "freedom of speech" occur in the year 2000? These counts were normalized and the result is approximated probability of n-grams throughout time. The data is available for download, and can also be viewed via the viewer.

While this data is very helpful for training NLP models, it can also provide some cultural, historical and sociological insights... and hours of fun!

Let's warm up with a simple example, exploring the change in the English language throughout time. I took a few synonyms of the word happy, and compared their usage between 1800 and 2008 (the last year available in the viewer). While the curves of merry, cheerful and delighted look pretty similar, the gay curve departs from the others in the 1980s. There's a reason for that, and that could be explained with the following graph: The word gay in the meaning of homosexual, has been in use since the 1950s, boosting its usage since.

The frequency of a certain term sometimes correlates with historical events. For example, while the word war is constantly in use, it was mostly prominent in books during and after World War I and World War II. See the peaks in the graph:



Another interesting thing to notice is that people (at least authors) are actually peaceful. Whenever they talk about wars, they also talk about peace. Look how similar the war and peace curves are:

The curve similarity may suggest that the same books that discuss war also mention peace, but since the war curve dominates the peace curve, I can only assume that war is the books' main topic and peace is only mentioned a couple of times. I hope that they say good things about peace.

Searching for World Trade Center shows that it was first mentioned around its construction in 1973, then there were a few years that it was hardly discussed, and then came along 09/11 and made it a very common topic. In some cases, the correlation with historical events is through new words that describe concepts or products. The time they start appearing in books is by the time of their invention / foundation. For example:

Facebook was founded in 2004.
Google was founded in 1998.
Twitter was founded in 2006. However, twitter is an English word that was already in use before 2006 (and as it seems, sometimes appeared capitalized, probably in the beginning of a sentence).
What about some older inventions?
The invention of the telephone, which is attributed to Alexander Graham Bell, in fact involved other inventors such as Antonio Meucci and Thomas Watson. They started in 1844, but Bell granted patent for the telephone in 1876. The television was invented in 1926. Which of them had greater influence on the world? If there is any correlation between being mentioned in books and having influence on the world, it seems like the television did. Having said that, telephone is commonly referred to as phone, and in recent years also includes cellphone and smartphone. So putting all these together changes the picture:
Some words were mentioned for periods of times and then just disappeared. Take for example this list of diseases, each relevant in different eras: Except for historical events, you can try to use the data to search for correlations between events or phenomena. Judge for yourself: Bear in mind that correlation doesn't imply cause-effect relation, and not even that a third factor impacts these two phenomena. Sometimes they just happen at the same time.

Just for the fun of it, can you guess which is the most important day of the week? It's Sunday! I expected that from English books, but I thought that Saturday will be more prominent in Hebrew books. That wasn't the case - the Hebrew graph was similar with Sunday way ahead of the other days. Maybe this happens because of translated books. Happy weekend everyone!