Saturday, May 28, 2005

Excellent document clustering paper

Paper on document clustering

Intersting points/concepts learnt:
  1. Vector Space Model - Map unstructured docs to a structured vector format based on text contained in the document. Dimensions of space are the complete set of terms (high d and sparse). Each document is mapped to a "concept" vector. The feature values could be the frequency of the terms and other frequency variants.
  2. Data Pre-processing - Normalize the text prior to creating the concept vectors - stemming, stop words, capitilization, etc. Remove words that provide little discriminating value - words with too high frequency or very low frequency. Domain knowledge can be of great help.
  3. Cosine Similarity measure - Good for sparse text data. Measure of the angle between 2 vectors.
  4. Clustering options - K-means and Hierarchical Clustering.
  5. Hierarchical Clustering - Group Avg found to be the best of the HAC techniques.
  6. K-Means
    • Sensitive to parameter choices - k and initial start points
    • k determined using Cluster ensembles.
    • found to be much better than Hierarchical Clustering
  7. Cluster ensembles
    • Vary k and initial start points
    • Based on experimentation - come up with a consensus k and start points
Challenges:
  1. High dimensional and sparse dataset
  2. Domain knowledge for data pre-processing
  3. (Related to above) Replacing words by their synonyms
Comments
  1. Finding the natural number of Clusters using Cluster ensembles. I could use the proposed technique in this paper to fix the hole in my Anomaly Detection IDS Approach.
  2. K-Means scalability. K-Means is fast - but it requires all the data to be in memory. Could Scalable K-Means or other online Clustering techniques such as BIRCH be a "better" approach?
  3. Maintaining synonyms seems like an onerous task. Is the "Latent Semantic Indexing" Approach a solution?

0 Comments:

Post a Comment

<< Home