Hamotzi's Data Mining Log: April 2006

5000 Clusters results

I purchased another Gig of memory ($100) from Frys and clustered the 799373 rows with 83595 dimensions into 5000 clusters using rbr (k-means repeated bisection method). I saw it use a max of 1.5G of RAM (I have 2.25G).
It took 50555.371 seconds (approximately 14 1/2 hours). The results were still not that great.

The top 10 clusters range from 0.627 to 0.481 in internal similarity. (As compared to 0.466 to 0.398 for 2500 clusters). So the results are definitely better - but not that great.

Picking Vectors from the Clusters (more)

Since the quality of the clusters was poor (even with 2500 clusters), I decided to try 5000 Clusters. CLUTO ran for 3 days - and did not produce any results (it crashed).
My options for getting better quality clusters are:

Increase the amount of RAM on my machine by a GIG
Split the dataset in "half" and cluster each half individually

For splitting, I could split the 2500 Clusters into 2 datasets of roughly equal size and then try clustering each dataset. That way I reduce the likelihood of splitting a natural cluster into two different datasets.

I prefer Option 1 - since Option 2 might have unintended consequences.
Time to call Fry's.

Picking Vectors from the Clusters

I need to pick 360 vectors using the 2500 Clusters.
I'll pick the "best" vector from the top 360 clusters.
To pick the best vector from a cluster, I will:
- calculate the centroid of the cluster
- pick the unlabeled vector that is most similar to this centroid
- in case of a draw, I'll pick one arbitrarily.

The clusters vary in size from a low of 51 to a high of about 1200. On an average, there are approximately 400 vectors per cluster. So the 360 clusters comprise of approximately 144,000 vectors. Picking representatives from each of these clusters should help the transductive learner.

Clustering Results - for Option1

I clustered the 799373 items with 83595 dimensions and the results were not that great.

I did not experiment with different clustering algorithms and did not vary their parameters, but used values that I've used in the past. So that might explain why the results were not that great.

Parameters used:
- rbr (Repeated bisections for k-way refinement)
- cosine similarity measure
- criterion function of I2 (see CLUTO docs for more information)

I created clusters of different sizes - 60, 180, 720 and 2500.
2500 Clusters produced the best results (not great - but much better than 60). For e.g. the highest internal Similarity for k=60 was +0.013 (which is really poor) [the min is 0 and the max is 1]. With 720, the top 10 ranged from 0.347 to 0.256. With 2500, the top 10 range from 0.466 to 0.398. These clusters are not that "tight" as the internal similarity is quite low. I would have like clusters of similarity > 0.5. I think by playing around some more with the different criterions and clustering methods, I can improve the results.
But for now - this is good enough and I need to move on to the next step. From a performance perspective, on my 3GHz, 1.2G Dell:
60 clusters => 614 seconds
180 clusters => 1500 seconds
720 clusters => 6800 seconds
2500 clusters=>24,000 seconds

Generating the "Option1" dataset

Category Name: Computers/Hardware
#of labeled vectors: 6
Therefore, the number of unlabeled vectors to be systematically selected = 60 * 6 = 360 (out of 799373)

Question: I need to pick 360 unlabeled vectors out of 790K vectors - what should k (the number of clusters) be?
I'll experiment briefly with the number of clusters and based on the quality of the clusters determine what k should be.
For starters I'll try:
(a) k = 60
(b) k = 100
(c) k = 120
I believe that the number of clusters should have an adverse effect (possibly exponential effect) on the time it takes to cluster. So lets see how it goes.

More details on this week's goal

See link

The objective is:

to train the transductive learner on labeled and carefully chosen unlabeled vectors
the hypothesis is that if you chose the vectors systematically, the results will be better

There are two approaches that I identified for systematically selecting the vectors are - Option 1 , Option 2. I will contrast them with each other and with a non-systematic random approach.
An additional parameter in this exercise is the number of unlabeled vectors to be chosen. I've identified two options again - for starters, I'll keep it simple and make it (60 * number of labeled vectors of that category).

I'm basically playing around with the technique and not systematically exploring it. As a result, my results might be misleading. A systematic approach would:

Vary size of the dataset
Vary the categories (I'm going to perform this only for one of the 67 categories)

This week's goals

Goal:

Explore the transductive learner (non-systematically)

Steps:

Build the transductive learning datasets using Option1 and Option2 (see link)
Train and validate the transductive learner on both datasets
Compare results

The objective is to get a feel on the capabilties of the technique without spending too much time on it.

Hamotzi's Data Mining Log

Friday, April 14, 2006

5000 Clusters results

Wednesday, April 12, 2006

Picking Vectors from the Clusters (more)

Thursday, April 06, 2006

Picking Vectors from the Clusters

Clustering Results - for Option1

Monday, April 03, 2006

Generating the "Option1" dataset

More details on this week's goal

This week's goals

About Me

Links

Previous Posts

Archives