Hamotzi's Data Mining Log: Transductive learner

SVM Lite has a transductive learner. One trains the classifier on a dataset containing labeled and unlabeled vectors. This seems interesting and something I want to try.

I need to think about how to pick the unlabeled vectors and the number of vectors. I'm thinking using my (still to be clearly defined) clustering technique to pick the vectors. The question is which vectors should be chosen?
Option 1: Create "several" small clusters. Pick n% of candidates from the top-k clusters. This will be a representative sample that (should) include samples from all categories.
Option 2: Create a biased dataset - make sure to include all potential samples that are possibly of the target category.
Option 3: Random selection

I think I'll try all 3 Options.

The size of the dataset:
Option1: make it a function of:

number of vectors that are known to belong to that category (based on the labeled dataset)
num labeled vectors belonging to other categories
total number of unlabeled vectors
for e.g. ( (# of cat)/(total labeled) ) * total unlabeled * (some constant)

Option 2: make it a function of:

number of vectors that belong to the category
for e.g. (# of cat * 60)

I will have to try different dataset sizes to see if there's any improvement. Note - these vectors that I'm selecting will not be manually labeled - but will be used for training. This is just an experiment that I'm conducting.

Hamotzi's Data Mining Log

Thursday, March 23, 2006

Transductive learner

0 Comments:

About Me

Previous Posts