Wednesday, March 15, 2006

CLUTO 2.1.1 experience

See () regarding objectives of clustering...

I clustered (using Cluto 2.1.1) the unlabeled and labeled data sets. I used the sparse matrix format - the rows are the documents while the columns are the features. So in this case, the rows are the queries (approximately 800,173) and the ngram counts (approximately 83K) are the columns.

I was surprised - Cluto was extremely fast - the rbr (bisected k-means method) took about 615seconds.
I wrote some code to generate an HTML report for each cluster - so I could look at the queries that were placed in each cluster.

I need to perform several experiments, varying the following parameters:
- number of clusters
- similarity measure (cosine or correlation)
- tf-idf versus feature count

CLUTO provides the following information regarding each cluster:
- internal sim
- internal stdev
- external sim
- external stdev

In addtion, I need to add:
- class purity* (see)
- # of distinct labels per cluster
- # of labeled data set elements per cluster
- ratio

0 Comments:

Post a Comment

<< Home