Clustering Objective
The fundemental problem with the datasets for the KDD competition is that the size of the labeled data set is very small (178 rows compared to 800K rows).
I need a mechanism of intelligently labeling a larger percentage (TBD) of the 800K rows to build a better quality classifier. Since labelling is a manual task, I need to get the best bang for my buck.
What I propose to do is the following:
- Cluster the Labeled+Unlabeled Data set
- For each cluster, determine a small set of prototypical vectors (queries) that best represent that cluster
- Label these vectors
This is better than randomly selecting queries because:
- It has the effect of labeling a large number of vectors that are similar to the prototypical vectors
- So you label in effect (not literally - have to leave something for the classifier to do) a large number of vectors with this approach
I need a mechanism of intelligently labeling a larger percentage (TBD) of the 800K rows to build a better quality classifier. Since labelling is a manual task, I need to get the best bang for my buck.
What I propose to do is the following:
- Cluster the Labeled+Unlabeled Data set
- For each cluster, determine a small set of prototypical vectors (queries) that best represent that cluster
- Label these vectors
This is better than randomly selecting queries because:
- It has the effect of labeling a large number of vectors that are similar to the prototypical vectors
- So you label in effect (not literally - have to leave something for the classifier to do) a large number of vectors with this approach
0 Comments:
Post a Comment
<< Home