Excellent document clustering paper
Paper on document clustering
Intersting points/concepts learnt:
Intersting points/concepts learnt:
- Vector Space Model - Map unstructured docs to a structured vector format based on text contained in the document. Dimensions of space are the complete set of terms (high d and sparse). Each document is mapped to a "concept" vector. The feature values could be the frequency of the terms and other frequency variants.
- Data Pre-processing - Normalize the text prior to creating the concept vectors - stemming, stop words, capitilization, etc. Remove words that provide little discriminating value - words with too high frequency or very low frequency. Domain knowledge can be of great help.
- Cosine Similarity measure - Good for sparse text data. Measure of the angle between 2 vectors.
- Clustering options - K-means and Hierarchical Clustering.
- Hierarchical Clustering - Group Avg found to be the best of the HAC techniques.
- K-Means
- Sensitive to parameter choices - k and initial start points
- k determined using Cluster ensembles.
- found to be much better than Hierarchical Clustering
- Cluster ensembles
- Vary k and initial start points
- Based on experimentation - come up with a consensus k and start points
- High dimensional and sparse dataset
- Domain knowledge for data pre-processing
- (Related to above) Replacing words by their synonyms
- Finding the natural number of Clusters using Cluster ensembles. I could use the proposed technique in this paper to fix the hole in my Anomaly Detection IDS Approach.
- K-Means scalability. K-Means is fast - but it requires all the data to be in memory. Could Scalable K-Means or other online Clustering techniques such as BIRCH be a "better" approach?
- Maintaining synonyms seems like an onerous task. Is the "Latent Semantic Indexing" Approach a solution?
0 Comments:
Post a Comment
<< Home