Wednesday, March 15, 2006

Clustering Objective

The fundemental problem with the datasets for the KDD competition is that the size of the labeled data set is very small (178 rows compared to 800K rows).
I need a mechanism of intelligently labeling a larger percentage (TBD) of the 800K rows to build a better quality classifier. Since labelling is a manual task, I need to get the best bang for my buck.
What I propose to do is the following:
- Cluster the Labeled+Unlabeled Data set
- For each cluster, determine a small set of prototypical vectors (queries) that best represent that cluster
- Label these vectors

This is better than randomly selecting queries because:
- It has the effect of labeling a large number of vectors that are similar to the prototypical vectors
- So you label in effect (not literally - have to leave something for the classifier to do) a large number of vectors with this approach

0 Comments:

Post a Comment

<< Home