Saturday, May 28, 2005

Paper review: A Comparison of Document Clustering Techniques

A Comparison of Document Clustering Techniques

They compare K-means - std and bisecting and HAC techniques. They found Bisecting K-means to be the best solution.

Interesting points/concepts learnt:
  1. Vector Space Model. Each doc is a vector d in the "term-space".
    • Term Frequency representation: Vector d = {tf1, tf2,...tfn} [tf is the term frequency in doc d]
    • Inverse Doc Frequency (IDF): discounts words with high frequency since they have little discriminating power
  2. Calculation of Centroids (and inter-cluster similarity) for Clusters created using the Cosine Similarity measure
  3. Bisecting K-Means is far more efficient than K-Means and for document clustering, creates better quality clusters
Comments:
  1. Several interesting references:
    • "Fast and Intuitive Clustering of Web
      Documents" - describes the use of Document Clustering to organize the results returned by a search engine in response to a user's query
    • "Hierarchically classifying documents using very few words" - generating hierarchical clusters of documents
    • "On the merits of building categorization
      systems by supervised clustering" - finds natural clusters in an already existing document taxonomy and then uses these clusters to produce an effective document classifier for new documents
  2. Calculation of centroid and inter-cluster similarity - can use this in my ADS paper. Can use this as another metric to understand the nature of hacks and attempt to detect them based on any observed patterns.
  3. CLUTO is perfect the tool for the job - Cosine similarity measure, Sparse and Dense matricies, high-d and large datasets and Bisecting K-Means
  4. Can I use the knowledge of the centroid to make my technique more scalable? Instead of maintaining the entire dataset in memory - maintain the centroid of each cluster? (like BIRCH)

0 Comments:

Post a Comment

<< Home