Hamotzi's Data Mining Log: Excellent document clustering paper

Paper on document clustering

Intersting points/concepts learnt:

Vector Space Model - Map unstructured docs to a structured vector format based on text contained in the document. Dimensions of space are the complete set of terms (high d and sparse). Each document is mapped to a "concept" vector. The feature values could be the frequency of the terms and other frequency variants.
Data Pre-processing - Normalize the text prior to creating the concept vectors - stemming, stop words, capitilization, etc. Remove words that provide little discriminating value - words with too high frequency or very low frequency. Domain knowledge can be of great help.
Cosine Similarity measure - Good for sparse text data. Measure of the angle between 2 vectors.
Clustering options - K-means and Hierarchical Clustering.
Hierarchical Clustering - Group Avg found to be the best of the HAC techniques.
K-Means

Challenges:

Comments

Finding the natural number of Clusters using Cluster ensembles. I could use the proposed technique in this paper to fix the hole in my Anomaly Detection IDS Approach.
K-Means scalability. K-Means is fast - but it requires all the data to be in memory. Could Scalable K-Means or other online Clustering techniques such as BIRCH be a "better" approach?
Maintaining synonyms seems like an onerous task. Is the "Latent Semantic Indexing" Approach a solution?

Hamotzi's Data Mining Log