Topic: Document Clustering
In document clustering, similar documents are grouped together in smaller subsets called clusters. Ideally, documents in a cluster are very similar to each other and quite dissimilar to documents in other clusters. Prior to clustering, the labels are unknown. Only a similarity measure is used to group the documents. After clustering, the clusters are examined - one could automatically generate labels and also automatically label the clusters. In this way, you completely by pass the expensive process of having humans come up with labels and generate a labeled training set. This approach however is less than perfect (why?). In spite of this (supposed) flaw, document clustering can be a useful task as it gives additional structure to the data. The clusters formed can be informative. Document clustering can provide a means for further analysis and prediction. For e.g., if we know little about the structure of documents - clustering can be used to get a better understanding of the types of documents - features they relate by, discriminating and descriptive features, etc.
Clustering mechanisms
- k-means
- Hierarchical Clustering
- EM Algorithm
It is widely used for document clustering and is relatively efficient.
Hierarchical Clustering
Agglomerative clustering organizes documents hierarchically. Similarity is decided by averages and also by maximum and minimum distances between documents within a cluster. The major issue of these techniques are timing and computational complexity.
EM Algorithm
This is a "soft clustering" algorithm where documents belong to each cluster with a certain probability. The sum of these probabilities is 1. This method uses sophisticated statistical modeling techniques and there is "very widely used " (in my experience, it is incredibly slow and is not scalable).
Understanding the Clusters
To enable further analysis of the data, one could, for each cluster, generate a set of discriminative and descriptive keywords. There are several ways to do this - these could be the most frequently occuring words in the cluster, or words with high tf-idf measures. The keywords can be presented to a human who can then generate a label for the documents in the cluster. We could also pick exemplars from each cluster. These are the prototypes or most representative documents from each cluster. For e.g. they could be the documents that are the most similar to the mean vector of the cluster. This can make exploratory data analysis more efficient since the human expert can now view the summarized properties of a document collection.
Determining the quality of clusters
We can compute the average similarity between the documents in a cluster - and the variance in similarity. If the variance is high, then the cluster is of poor quality. The number of clusters can make a big difference in the quality of clusters. Larger the number of clusters - lower the variance - but also this will affect the usability of these clusters. Manual evaluation of cluster results is the best solution. Since we can read and comprehend documents, we can examine the clustering results and evaluate the results on our own.
References