Hamotzi's Data Mining Log: Project approach

My report could be on measuring the effects of the unlabeled dataset augmentation using my Clustering/summarization technique.

My hypothesis is that the unlabeled dataset is too small in relation to the labeled dataset. Therefore an SVM classifier trained on the small labeled dataset will suck on the unlabeled dataset.
A secondary point is that my approach for "intelligent" labelling is a good one and is better than a random approach.

Before we dive into the details:

I plan to use the SVM classifier as a binary classifier. So for each query, each category will be treated distinctly. Given a query Q, what is the probability that Q is Category A and the probability that it is not Category A. The higher probability is the prediction. In cases where there are more than 3 categories predicted, I'll use a ranking scheme (break tie by probability or manual tie breaking) to determine the categories. In cases where no category is predicted - well - that would suck
There are 3 datasets in question - a training dataset of 170 odd labeled queries, a validation dataset of 800 queries and an unlabeled dataset of ( 800K - 800) queries

To prove the main hypothesis, I will compare the results of training the classifier on - the unaugmented unlabeled dataset, a slightly augmented dataset (1%), some more augmentation, etc. What I hope to show is that the augmentation helps. I plan to augment using two different techniques - a random process and my "intelligent" process of cluster summarization. For each of the classifications, the metrics I'll capture are:

% of queries successfully classified in 3 categories
% of queries successfully classified in 1 or more categories

My assumption is that without my boosting of the dataset, the classifier will suck. Hopefully, the boost to the classifier is noticable enough - given an augmentation.

Then as a validation step, I'll classify the 800 queries and compare the quality of my results.

The main threat to this approach is that I'll have to augment a very large % of the dataset to even show a noticable improvement in the classification.

Also, the augmentation will be performed in a subjective manner - by me.

Hamotzi's Data Mining Log

Thursday, March 16, 2006

Project approach

0 Comments:

About Me

Previous Posts