Given a piece of text (document), can software recognize the topic(s) it discusses? If you're not convinced that such a thing could be helpful, let's just start with two facts:
- 90% of the data in the world today has been created in the last two years [1].
- Our attention span is now less than that of a goldfish [2], and we almost never read through an article [3].
These two together lead to sooo much data that might be of interest to you, that you'll never read. If only there were someone who could read articles for you and decide whether they intersect with your topics of interest. Well, automatic text classification can assist you with that.
Representation
As in the last post about word representation, we must first decide how to represent the documents. Intuitively, we want the algorithm to classify a document to a certain topic based on the document's content, as in figure 1. We need the document representation to reflect that.
The simplest and most common approach is the "bag-of-words" approach, in which each document is represented by all its words. Some words may be indicative of one topic or another, e.g. soccer, player, and match might indicate that a document is about sports, while government, prime minister, and war suggest that this is a news article. If you remember the spam filtering example from the supervised learning post, this is exactly the same: some words in the mail message may indicate that it is spam. Spam filtering can be regarded as a text classification task in which each document is a mail message, classified to one of the two topics spam / not spam.
Figure 1: Two example documents, one (Doc1) about computer science and the other (Doc2) about news. |
The bag-of-words approach is simple, yet may yield nice results. However, it ignores multi-word expressions (document classification) some of which are non-compositional, i.e. the phrase's meaning is different from the separate word meanings (rock and roll). It also ignores word order, and syntax. Some other methods can consider these features as well. In this post we will stick to the simple bag-of-words approach, which is often good enough for this task.
Methods
Choosing the method in which documents will be classified to topics depends, first of all, on the available data. If we have a sample of labeled documents (documents with known topics), we will prefer supervised classification. In supervised learning, the algorithm is given a training set of labeled instances (e.g. document and its topic), and it tries to learn a model that predicts those labels correctly. This model is later used to predict the label (topic) of unseen instances (documents).
Otherwise, if we only have a bunch of documents without their topics (which is the more common case), we will apply unsupervised classification (clustering). Instead of trying to attach a label (topic) to each instance (document), the algorithm groups together similar documents, which seem to be about the same topic. The output of the clustering process is clusters of documents, each cluster represents a topic.
Supervised Document Classification
Different methods to classify documents differ in the instance representation and the learning algorithms. The general scheme is to represent each document as a feature vector, use a multi-class classifier and feed it the training instances and labels (documents and their topics) to learn a model. Then, given a new unlabeled document, the classifier can predict the document's topic, with some level of success.
Following the bag-of-words approach, we may represent each document as a |V|-dimensional vector, where V is our vocabulary. Each cell represents a word which may or may not appear in the document. There are a few variants of the cells content, here are some common ones:
- Binary - 1 if the word occurred in the document, 0 otherwise. It is a set representation of the document's words.
- Frequency - the number of times that a word occurred in the document. We can expect that topic prominent words would appear frequently in the topic documents, while other words would occur occasionally.
- TF-IDF - I might be skipping a few simpler metrics, but this one is very useful. When we count the frequency of each word, we might end up with some words that are frequent in all documents, regardless of the topic. For instance, stop words and function words (the, it, and, what...) are never indicative of a certain topic. The TF-IDF metric handles this problem by measuring the importance of a term in a certain document, given all the other documents. The TF (Term-Frequency) measure is proportional to the word frequency in the document. The IDF (Inverse-Document-Frequency) decreases if the word is generally frequent in all documents. This way, a word gets a high score if it is relatively non-common in general, but common in the specific document.
Figure 2: The corresponding (partial) bag-of-words vectors for the example documents, with the binary (top) and frequency (bottom) variants. |
It is no coincidence this post is following the one about word representations: speaking in word representation terminology, we can now see that the frequency document vector is the sum of one-hot vectors of all the words in the document. Can you see it?
Correspondingly, when working with word embeddings (continuous dense vectors), there is a variant of bag-of-words called CBOW (continuous bag-of-words): it is the sum/average of word vectors for each word in the document. This results in a D-dimensional vector that represents the entire document, where D is the word vectors dimension.
Once we represented each document as a feature vector, we let the classifier train over the feature vectors and corresponding labels. It may notice that feature vectors with high values in the cells of soccer, player, and match tend to occur with the label sports, for example.
There is a broad choice of algorithms, some may perform better than others. Roughly, they all try to do the same thing: the multi-class classifier learns a weight for each feature and each label. In our text classification task, it learns the salience of each word in the vocabulary for each topic. For example, soccer, player, and match will be assigned high weights and government and prime minister will be assigned low (maybe negative) weights for the sports label, and the opposite will occur for the news label. When the classifier is trained, the objective is to learn the weights that maximize the accuracy of the training set, i.e., classifying as many documents as possible to the correct topic.
For you coders, here is the simplest proof of concept. Other people may skip the code. This is a Python script, that works with scikit-learn, a machine learning package for Python. I used a subset of the topics in the 20 newsgroup dataset. I trained a simple logistic regression classifier, removing stop words and punctuation, and representing each document using CBOW with GloVe word embeddings (my personal favorite). This yields 83% F1 score, which has much potential of improvement, but it's still nice with such little effort.
document_classification.py
As an example, I printed one of the documents and the topic that the classifier assigned to it:
rec.sport.baseball
I'd say it's an easy one, since it specifically mentions "baseball". However, nothing should be taken for granted in NLP! Anything that works is a miracle :)
Unsupervised Document Classification
In some cases, we don't have labeled data, so we can't learn characteristics of instances with specific labels, e.g., words that tend to occur in documents about politics. Instead, we can find common characteristics between documents about the same (unknown) topic, and group documents from the (seemingly) same topic together in one cluster.
Then, when a new document is presented, we can assign it to the most suitable cluster, based, for example, on how similar it is to the documents in each cluster. This has several purposes; first of all, we can let someone look at a few documents from each cluster and infer the topic represented by the cluster. We can also take the most common words in this cluster, automatically, and use them as "tags" that describe the topic. Another usage can be recommending a document to someone who read other documents in this cluster (assuming that they are interested in the cluster's topic).
While we don't have the true labels (topics) of the training instances (documents), we assume that each document has a certain topic, which is a hidden variable in our model (as opposed to the documents themselves, which are observed). One clustering approach is through generative models: we assume that a model generated our existing documents, and we use an algorithm that tries to reconstruct the parameters of the model, in a way the best explains our observed data.
The assumption of the generative model is that our training data was generated through probabilities, that we would like to reconstruct. The simpler model, called mixture of histograms, assumes that each document has one topic. A document was generated as follows:
The goal of the algorithm is to learn the probability functions P(c), and P(w|c) for each topic, given solely the documents themselves. These probabilities are estimated using an iterative algorithm called EM (expectation maximization). Since the topic of each document is unknown (hidden), you should first decide on the number of topics (how fine-grained would you like the clustering to be? should you distinguish between football and basketball or is sports enough?). Then, start with a random guess of the probabilities. The algorithm works in iterations, improving the probabilities at each iteration. Each iteration is made of two steps:
That's it about text classification for now. Now, can you code something that automatically infers the topic of this blog post? :)
Correspondingly, when working with word embeddings (continuous dense vectors), there is a variant of bag-of-words called CBOW (continuous bag-of-words): it is the sum/average of word vectors for each word in the document. This results in a D-dimensional vector that represents the entire document, where D is the word vectors dimension.
Once we represented each document as a feature vector, we let the classifier train over the feature vectors and corresponding labels. It may notice that feature vectors with high values in the cells of soccer, player, and match tend to occur with the label sports, for example.
There is a broad choice of algorithms, some may perform better than others. Roughly, they all try to do the same thing: the multi-class classifier learns a weight for each feature and each label. In our text classification task, it learns the salience of each word in the vocabulary for each topic. For example, soccer, player, and match will be assigned high weights and government and prime minister will be assigned low (maybe negative) weights for the sports label, and the opposite will occur for the news label. When the classifier is trained, the objective is to learn the weights that maximize the accuracy of the training set, i.e., classifying as many documents as possible to the correct topic.
For you coders, here is the simplest proof of concept. Other people may skip the code. This is a Python script, that works with scikit-learn, a machine learning package for Python. I used a subset of the topics in the 20 newsgroup dataset. I trained a simple logistic regression classifier, removing stop words and punctuation, and representing each document using CBOW with GloVe word embeddings (my personal favorite). This yields 83% F1 score, which has much potential of improvement, but it's still nice with such little effort.
1 import sys 2 import nltk 3 import string 4 import codecs 5 6 from sklearn.datasets import fetch_20newsgroups 7 from sklearn.metrics import precision_recall_fscore_support 8 from sklearn.linear_model import LogisticRegression 9 10 import numpy as np 11 12 13 def main(): 14 15 # Load the word vectors 16 words, wv = load_embeddings('glove.6B.50d.txt') 17 word_to_num = { word : i for i, word in enumerate(words) } 18 19 # Load the stop words 20 with codecs.open('English_stop_words.txt', 'r', 'utf-8') as f_in: 21 stop_words = set([line.strip() for line in f_in]) 22 23 # Load the datasets 24 topics = ['talk.politics.guns', 'soc.religion.christian', 25 'comp.windows.x', 'rec.sport.baseball', 'sci.med'] 26 newsgroups_train = fetch_20newsgroups(subset='train', 27 remove=('headers', 28 'footers', 29 'quotes'), 30 categories=topics) 31 y_train = list(newsgroups_train.target) 32 newsgroups_test = fetch_20newsgroups(subset='test', 33 remove=('headers', 34 'footers', 35 'quotes'), 36 categories=topics) 37 y_test = list(newsgroups_test.target) 38 39 # Create the feature vectors 40 X_train = create_doc_vectors(newsgroups_train.data, 41 word_to_num, wv, stop_words) 42 X_test = create_doc_vectors(newsgroups_test.data, 43 word_to_num, wv, stop_words) 44 45 # Create the classifier 46 classifier = LogisticRegression() 47 48 # Train the classifier 49 classifier.fit(X_train, y_train) 50 51 # Predict the topics of the test set and compute 52 # the evaluation metrics 53 y_pred = classifier.predict(X_test) 54 precision, recall, f1, support = \ 55 precision_recall_fscore_support(y_test, y_pred, 56 average='weighted') 57 58 print 'Precision: %.02f%%, Recall: %.02f%%, F1: %.02f%%' \ 59 % (precision * 100, recall * 100, f1 * 100) 60 61 62 def create_doc_vectors(data, word_to_num, wv, stop_words): 63 """ 64 Create a matrix in which each row is a document, 65 and each document is represented as CBOW 66 """ 67 doc_vecs = [] 68 69 for doc in data: 70 tokens = nltk.word_tokenize(doc.lower()) 71 tokens = [w for w in tokens 72 if w not in string.punctuation 73 and w not in stop_words] 74 tokens = [word_to_num.get(w, -1) for w in tokens] 75 doc_vector = [wv[w] for w in tokens if w > -1] 76 doc_vector = np.mean(np.vstack(doc_vector), axis=0) \ 77 if len(doc_vector) > 0 else np.zeros((1, 50)) 78 doc_vecs.append(doc_vector) 79 80 instances = np.vstack(doc_vecs) 81 return instances 82 83 84 def load_embeddings(embedding_file): 85 """ 86 Load the pre-trained embeddings from a file 87 """ 88 with codecs.open(embedding_file, 'r', 'utf-8') as f_in: 89 words, vectors = \ 90 zip(*[line.strip().split(' ', 1) for line in f_in]) 91 vectors = np.loadtxt(vectors) 92 93 # Normalize each row (word vector) in the matrix to sum-up to 1 94 row_norm = np.sum(np.abs(vectors)**2, axis=-1)**(1./2) 95 vectors /= row_norm[:, np.newaxis] 96 97 return words, vectors 98 99 100 if __name__ == '__main__': 101 main() 102
As an example, I printed one of the documents and the topic that the classifier assigned to it:
rec.sport.baseball
I'd say it's an easy one, since it specifically mentions "baseball". However, nothing should be taken for granted in NLP! Anything that works is a miracle :)
Unsupervised Document Classification
In some cases, we don't have labeled data, so we can't learn characteristics of instances with specific labels, e.g., words that tend to occur in documents about politics. Instead, we can find common characteristics between documents about the same (unknown) topic, and group documents from the (seemingly) same topic together in one cluster.
Then, when a new document is presented, we can assign it to the most suitable cluster, based, for example, on how similar it is to the documents in each cluster. This has several purposes; first of all, we can let someone look at a few documents from each cluster and infer the topic represented by the cluster. We can also take the most common words in this cluster, automatically, and use them as "tags" that describe the topic. Another usage can be recommending a document to someone who read other documents in this cluster (assuming that they are interested in the cluster's topic).
While we don't have the true labels (topics) of the training instances (documents), we assume that each document has a certain topic, which is a hidden variable in our model (as opposed to the documents themselves, which are observed). One clustering approach is through generative models: we assume that a model generated our existing documents, and we use an algorithm that tries to reconstruct the parameters of the model, in a way the best explains our observed data.
The assumption of the generative model is that our training data was generated through probabilities, that we would like to reconstruct. The simpler model, called mixture of histograms, assumes that each document has one topic. A document was generated as follows:
- The document's topic c was drawn from the topics' distribution (probability function) P(C).
For example, if we have 3 topics, news, sports and music, with the probabilities 0.5, 0.3, 0.2 respectively, then in half of the cases we are expected to "generate" a document about news. - Given the topic c, the words in the document were sampled from a distribution of words given the topic, P(w|c). For example, if the topic is news, the probability of each word w in the vocabulary in the news topic is P(w|news). Since there are many words in the vocabulary, the probability for each of them is quite small (because the probability of all words sums up to 1). Anyhow, words that are likely to appear in news documents, such as war and report will have higher probabilities, e.g. P(war|news) > P(football|news). When we sample words for the generated news document, we will get mostly words discussing news.
Figure 3: An illustration of the probability distribution of word given a topic. |
The goal of the algorithm is to learn the probability functions P(c), and P(w|c) for each topic, given solely the documents themselves. These probabilities are estimated using an iterative algorithm called EM (expectation maximization). Since the topic of each document is unknown (hidden), you should first decide on the number of topics (how fine-grained would you like the clustering to be? should you distinguish between football and basketball or is sports enough?). Then, start with a random guess of the probabilities. The algorithm works in iterations, improving the probabilities at each iteration. Each iteration is made of two steps:
- Assign a topic to each document based on the current probabilities.
- Given the documents' topic computed at the previous step, re-estimate the probabilities using relative frequency (e.g., the probability of news is the ratio between number of the documents assigned this topic, and the total number of documents).
This algorithm works nicely and it can be used to solve other problems with hidden variables. In the end, each document is assigned to a certain (meaningless) topic. As I mentioned before, the meaning of this cluster of documents can be understood by looking at the common features of several documents in the cluster (e.g. do they all discuss music?).
That's it about text classification for now. Now, can you code something that automatically infers the topic of this blog post? :)
There is so much more that I didn't cover in this post, because I don't want to exhaust anyone. If you're interested in reading more, I recommend:
- The difference between generative and discriminative models - it's a general machine learning topic, not specific to document classification. In this post I gave an example of a supervised discriminative model and an unsupervised generative model, but:
- The k-means algorithm is an unsupervised discriminative model that can be used for text classification.
- Naïve Bayes is a supervised generative classifier that can be used for text classification.
- Document classification can also be used for language identification.
- And you can't write a post about text classification without mentioning LDA. It was a bit too complex for this post, but here, I mentioned it.