Thursday, April 06, 2006

Clustering Results - for Option1

I clustered the 799373 items with 83595 dimensions and the results were not that great.

I did not experiment with different clustering algorithms and did not vary their parameters, but used values that I've used in the past. So that might explain why the results were not that great.

Parameters used:
- rbr (Repeated bisections for k-way refinement)
- cosine similarity measure
- criterion function of I2 (see CLUTO docs for more information)

I created clusters of different sizes - 60, 180, 720 and 2500.
2500 Clusters produced the best results (not great - but much better than 60). For e.g. the highest internal Similarity for k=60 was +0.013 (which is really poor) [the min is 0 and the max is 1]. With 720, the top 10 ranged from 0.347 to 0.256. With 2500, the top 10 range from 0.466 to 0.398. These clusters are not that "tight" as the internal similarity is quite low. I would have like clusters of similarity > 0.5. I think by playing around some more with the different criterions and clustering methods, I can improve the results.
But for now - this is good enough and I need to move on to the next step. From a performance perspective, on my 3GHz, 1.2G Dell:
60 clusters => 614 seconds
180 clusters => 1500 seconds
720 clusters => 6800 seconds
2500 clusters=>24,000 seconds

0 Comments:

Post a Comment

<< Home