Paper review: A Comparison of Document Clustering Techniques
A Comparison of Document Clustering Techniques
They compare K-means - std and bisecting and HAC techniques. They found Bisecting K-means to be the best solution.
Interesting points/concepts learnt:
They compare K-means - std and bisecting and HAC techniques. They found Bisecting K-means to be the best solution.
Interesting points/concepts learnt:
- Vector Space Model. Each doc is a vector d in the "term-space".
- Term Frequency representation: Vector d = {tf1, tf2,...tfn} [tf is the term frequency in doc d]
- Inverse Doc Frequency (IDF): discounts words with high frequency since they have little discriminating power
- Calculation of Centroids (and inter-cluster similarity) for Clusters created using the Cosine Similarity measure
- Bisecting K-Means is far more efficient than K-Means and for document clustering, creates better quality clusters
- Several interesting references:
- "Fast and Intuitive Clustering of Web
Documents" - describes the use of Document Clustering to organize the results returned by a search engine in response to a user's query - "Hierarchically classifying documents using very few words" - generating hierarchical clusters of documents
- "On the merits of building categorization
systems by supervised clustering" - finds natural clusters in an already existing document taxonomy and then uses these clusters to produce an effective document classifier for new documents - Calculation of centroid and inter-cluster similarity - can use this in my ADS paper. Can use this as another metric to understand the nature of hacks and attempt to detect them based on any observed patterns.
- CLUTO is perfect the tool for the job - Cosine similarity measure, Sparse and Dense matricies, high-d and large datasets and Bisecting K-Means
- Can I use the knowledge of the centroid to make my technique more scalable? Instead of maintaining the entire dataset in memory - maintain the centroid of each cluster? (like BIRCH)
0 Comments:
Post a Comment
<< Home