Hamotzi's Data Mining Log: March 2006

Transductive learner

SVM Lite has a transductive learner. One trains the classifier on a dataset containing labeled and unlabeled vectors. This seems interesting and something I want to try.

I need to think about how to pick the unlabeled vectors and the number of vectors. I'm thinking using my (still to be clearly defined) clustering technique to pick the vectors. The question is which vectors should be chosen?
Option 1: Create "several" small clusters. Pick n% of candidates from the top-k clusters. This will be a representative sample that (should) include samples from all categories.
Option 2: Create a biased dataset - make sure to include all potential samples that are possibly of the target category.
Option 3: Random selection

I think I'll try all 3 Options.

The size of the dataset:
Option1: make it a function of:

number of vectors that are known to belong to that category (based on the labeled dataset)
num labeled vectors belonging to other categories
total number of unlabeled vectors
for e.g. ( (# of cat)/(total labeled) ) * total unlabeled * (some constant)

Option 2: make it a function of:

number of vectors that belong to the category
for e.g. (# of cat * 60)

I will have to try different dataset sizes to see if there's any improvement. Note - these vectors that I'm selecting will not be manually labeled - but will be used for training. This is just an experiment that I'm conducting.

SVM Lite - Preliminary results

I experimented briefly with SVM lite in an unsystematic manner to get a feel for the tool.
The results were better than libsvm. The parameters for the tools are similar - though svm lite has more knobs.
The best results I could achieve - was a little less than 5% precision with around 55% recall. I'm surprised that the recall rate was that high. The precision was disappointing.

SVM Light - First look

At first glance, SVM Light has a bizillion more parameters than libsvm ( a good and bad thing).
I've downloaded the binaries. My objective is to play around with it and see if I can improve on the poor results of libsvm. Also, performance and scalability may be an issue.

Update: On Goal for this week

I generated the Berkley datasets. It is definitely much faster than using MySQL. So I have the following datasets:

Validation dataset
Training dataset

Each of these datsets have the following tuples:
<>* *>
i.e for each query, I have the categories the query belongs to and the ngrams. For the ngrams, I have the frequency in the query and the tf-idf (term frequency, inverse document frequency) and the normalized tf-idf values.

I picked category 1 (Computers/Hardware) and I generated a training and validation SVM dataset from the above berkley datasets. The format of the training and validation datasets is:
*

Data Statistics:

the training dataset had only 6+/172- (i.e. 6 instances of this category and 172 instances that did not belong to this category).
the validation dataset had 23+/777-

Results:

Since the data in unbalanced, the results were as expected - horrible.
I tried the RBF kernel with several different parameters - including the ones generated by easy.py and grid.py - no cigar. All the 23 instances were classified as -ve instances i.e. all 800 were classified as ~Computers/Hardware

Possible Next steps:

Validate the tf-idf calculations are correct (I need to double-check the code)
Is it possible that my datasets are corrupt (incorrect?)?
Try another SVM software - SVM Lite
Try another category that has more examples

Issue: Parameter tuning

libsvm has several parameters - kernel type, cost, ...
Is there a systematic approach that I can use to find an ideal (or good enough) set of parameters?

Goal for this week

My goal for this week is:
(1) Generate Berkley DB datasets
Stretch goals:
(2) Train classifier for one category
(3) Predict - on validation dataset

Discussion
(1) Generate Berkley DB datasets
MySQL is a bottleneck. Converting the datasets to Berkley should help performance. The datasets I'm gonna create will have tf-idf and normalized tf-idf values.
(2) Train classifier on one category
I might face an issue here. The number of rows is small (about 178) - but the number of features is 83K - so libsvm may not scale. I might have to re-write parts of the SVM software and/or reduce the number of features or use SVMLite.
(3) Predict on the Validation set
Again, this is the same problem as (2). The advantage of predicting on the validation set is that it is considerably smaller (I don't know the size yet).

Project approach

My report could be on measuring the effects of the unlabeled dataset augmentation using my Clustering/summarization technique.

My hypothesis is that the unlabeled dataset is too small in relation to the labeled dataset. Therefore an SVM classifier trained on the small labeled dataset will suck on the unlabeled dataset.
A secondary point is that my approach for "intelligent" labelling is a good one and is better than a random approach.

Before we dive into the details:

I plan to use the SVM classifier as a binary classifier. So for each query, each category will be treated distinctly. Given a query Q, what is the probability that Q is Category A and the probability that it is not Category A. The higher probability is the prediction. In cases where there are more than 3 categories predicted, I'll use a ranking scheme (break tie by probability or manual tie breaking) to determine the categories. In cases where no category is predicted - well - that would suck
There are 3 datasets in question - a training dataset of 170 odd labeled queries, a validation dataset of 800 queries and an unlabeled dataset of ( 800K - 800) queries

To prove the main hypothesis, I will compare the results of training the classifier on - the unaugmented unlabeled dataset, a slightly augmented dataset (1%), some more augmentation, etc. What I hope to show is that the augmentation helps. I plan to augment using two different techniques - a random process and my "intelligent" process of cluster summarization. For each of the classifications, the metrics I'll capture are:

% of queries successfully classified in 3 categories
% of queries successfully classified in 1 or more categories

My assumption is that without my boosting of the dataset, the classifier will suck. Hopefully, the boost to the classifier is noticable enough - given an augmentation.

Then as a validation step, I'll classify the 800 queries and compare the quality of my results.

The main threat to this approach is that I'll have to augment a very large % of the dataset to even show a noticable improvement in the classification.

Also, the augmentation will be performed in a subjective manner - by me.

Project approach

My report could be on measuring the effects of the unlabeled dataset augmentation using my Clustering/summarization technique.

My hypothesis is that the unlabeled dataset is too small in relation to the labeled dataset. Therefore an SVM classifier trained on the small labeled dataset will suck on the unlabeled dataset.
A secondary point is that my approach for "intelligent" labelling is a good one and is better than a random approach.

Before we dive into the details:

I plan to use the SVM classifier as a binary classifier. So for each query, each category will be treated distinctly. Given a query Q, what is the probability that Q is Category A and the probability that it is not Category A. The higher probability is the prediction. In cases where there are more than 3 categories predicted, I'll use a ranking scheme (break tie by probability or manual tie breaking) to determine the categories. In cases where no category is predicted - well - that would suck
There are 3 datasets in question - a training dataset of 170 odd labeled queries, a validation dataset of 800 queries and an unlabeled dataset of ( 800K - 800) queries

To prove the main hypothesis, I will compare the results of training the classifier on - the unaugmented unlabeled dataset, a slightly augmented dataset (1%), some more augmentation, etc. What I hope to show is that the augmentation helps. I plan to augment using two different techniques - a random process and my "intelligent" process of cluster summarization. For each of the classifications, the metrics I'll capture are:

% of queries successfully classified in 3 categories
% of queries successfully classified in 1 or more categories

My assumption is that without my boosting of the dataset, the classifier will suck. Hopefully, the boost to the classifier is noticable enough - given an augmentation.

Then as a validation step, I'll classify the 800 queries and compare the quality of my results.

The main threat to this approach is that I'll have to augment a very large % of the dataset to even show a noticable improvement in the classification.

Also, the augmentation will be performed in a subjective manner - by me.

Issue: Understanding CLUTO metrics

So there might be clusters which suck - too sparse, not good "enough" internal similarity - too "high" external similarity (not tight and separated from the other clusters).

CLUTO provides iSim, etc. metrics. How do I know what's a good number?
Maybe - using the visualization features I can determine the cluster quality?

Determining prototypical vectors

Given a cluster, one way of determining the representatives or prototypes of vectors in that cluster could be:
- generate a mean vector with normalized tf-idf features
- the prototypes are the vectors that are most similar (based on the cosine sim measure) to this mean vector

Calculation of a mean vector:
- Take the values of the features for all the vectors in a cluster
- Calculate the mean for a each feature
- Normalize the mean

Issue: Databases

My bottleneck in processing is the I/O.
The advantage of using a SQL databse is that I can run ad-hoc queries and understand my data better. The disadvantage is that it is a remote process and is slow.
As a compromise, I might use the Berkley embedded database to store information that I really will not run queries against but I need fast access to AND is going to change often. So the normalized tf-idf metrics could be stored in a Berkley DB.
The "perceived" advantage of the Berkley DB is that its embedded and should be faster than MySQL - but far slower than a file and much slower than in-memory data structures.
Hmmm since if have one gig of RAM and can expand it to 4G - may be in memory data structures are not a bad idea for certain metrics.

Issue: tf-idf calculation issue

The problem with the tf-idf metric is that if I modify a single query - say change a spelling error - I need to regenerate the tf-idf metrics for the entire data set. After which it has to be normalized!
This is a very expensive operation - I wonder if instead of the MySQL database if I used the BerkleyDB, would it be faster?

Document representation for Clustering

Currently, I'm using shared counts of n-grams - this is a sparse matrix format.
I believe that I would get better performance if I changed it to a normalized tf-idf format and used the cosine similarity measure.

Sample results from Google API

I downloaded the Google API and ran their demo client (see results below).

Only one of the returned results had a directory category. I wonder if there is a way to limit results only to the ones that have been categorized - need to look at the WSDL.

$ java -cp googleapi.jar com.google.soap.search.GoogleAPIDemo `cat lic` search pictures
Parameters:
Client key = ***********************
Directive = search
Args = pictures
Google Search Results:
======================
{
TM = 0.390328
Q = "pictures"
CT = ""
TT = ""
CATs =
{
{SE="", FVN="Top/Computers/Graphics/Clip_Art/Pictures"}
}
Start Index = 1
End Index = 10
Estimated Total Results Number = 64400000
Document Filtering = false
Estimate Correct = false
Rs =
{

[
URL = "http://www.freefoto.com/"
Title = "FreeFoto.com - Free Pictures - FreeFoto.Com"
Snippet = "FreeFoto.com is the largest collections of free photographs for non-commercial
use on the Internet."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "19k"
Related information present = true
Host Name = ""
],

[
URL = "http://www.paramount.com/"
Title = "Paramount Pictures"
Snippet = "Feature film production and distribution, video and DVD worldwide distribution,
and production of programs for television broadcast and syndication."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "2k"
Related information present = true
Host Name = ""
],

[
URL = "http://www.insanepictures.com/"
Title = "Insane Pictures - Funny Pictures, Photos, Movies, Games, Comics ..."
Snippet = "A collection of funny pictures, photos, cartoons, comics, jokes, flash and games,
updated daily!"
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "34k"
Related information present = true
Host Name = ""
],

[
URL = "http://www.sonypictures.com/"
Title = "Sony Pictures"
Snippet = "Information about the large number of movies and television programs produced by
Sony Pictures."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "46k"
Related information present = true
Host Name = ""
],

[
URL = "http://www.corbis.com/"
Title = "Corbis: stock photography and digital pictures"
Snippet = "pictures, stock photography, digital stock photography, stock photos, stock image,
stock pictures, stock agency, digital images, advertising stock, ..."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "60k"
Related information present = true
Host Name = ""
],

[
URL = "http://gallery.yahoo.com/"
Title = ""
Snippet = ""
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "1k"
Related information present = true
Host Name = ""
],

[
URL = "http://www.universalpictures.com/"
Title = "Universal Pictures"
Snippet = "Details of current and future releases, including cast lists and synopses, with
links to trailers."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "5k"
Related information present = true
Host Name = ""
],

[
URL = "http://hubblesite.org/newscenter/"
Title = "STScI/HST Pictures"
Snippet = "Hubble Space Telescope Public Pictures."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "14k"
Related information present = true
Host Name = ""
],

[
URL = "http://antwrp.gsfc.nasa.gov/apod/archivepix.html"
Title = "Astronomy Picture of the Day Archive"
Snippet = "Daily archive of interesting astronomy photos by NASA."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "101k"
Related information present = true
Host Name = ""
],

[
URL = "http://www.universalstudios.com/"
Title = "Welcome to Universal Studios.com"
Snippet = "We do this through our motion pictures and home videos, theme parks and attractions,
television networks and programming, and much more."
Directory Category = {SE="", FVN=""}
Directory Title = ""
Summary = ""
Cached Size = "14k"
Related information present = true
Host Name = ""
]
}
}

Issue: Manual enhancement of the keywords for the categories

I could manually associate each of the categories with new keywords.
Other teams have taken this approach - (Eric).

Its a pain in the butt to regenerate the tf-idf values and regenerate the ngrams, so this will be an extremely expensive task.
Another approach might be to create fictional queries that are the key words. This might be an easier task.

Issue: Training data set size

So my current focus is on increasing the size of the data set.
How large should this labeled dataset be?

* 1% of 800K is 800
* 5% -> 4000

If I can increase it to about 4000 - I might - (since the task is manual and subjective) be able to improve the classification results.

Use of Google API

Google provides a mechanism (a webservices API) to programmatically search.
I could use this API to get "hints" for building up my labeled data set.
Don't like it - but others have done it for the KDD cup.

Issue: Calculating Class purity

How do you calculate class purity in a multi-category scenario?
If a query can have 3 classes, then how does the purity get calculated?

Manually labeling prototypes

I have a two-pronged strategy for labeling cluster elements:
(1) A cluster might contain one or more labeled vectors. I could use these as suggestions for manually labeling the cluster elements.
(2) I could use the Google API to build a list of suggested categories and use this for manually labeling the Cluster elements.

(1) depends on the cluster purity and the number of labeled instances in the cluster.
(2) I don't like this approach - but it was used by other competitors.

Issue: Determining prototypical vectors using CLUTO

I need to figure out a way of finding the prototypes or representatives for a cluster in CLUTO.

Clustering Objective

The fundemental problem with the datasets for the KDD competition is that the size of the labeled data set is very small (178 rows compared to 800K rows).
I need a mechanism of intelligently labeling a larger percentage (TBD) of the 800K rows to build a better quality classifier. Since labelling is a manual task, I need to get the best bang for my buck.
What I propose to do is the following:
- Cluster the Labeled+Unlabeled Data set
- For each cluster, determine a small set of prototypical vectors (queries) that best represent that cluster
- Label these vectors

This is better than randomly selecting queries because:
- It has the effect of labeling a large number of vectors that are similar to the prototypical vectors
- So you label in effect (not literally - have to leave something for the classifier to do) a large number of vectors with this approach

CLUTO 2.1.1 experience

See () regarding objectives of clustering...

I clustered (using Cluto 2.1.1) the unlabeled and labeled data sets. I used the sparse matrix format - the rows are the documents while the columns are the features. So in this case, the rows are the queries (approximately 800,173) and the ngram counts (approximately 83K) are the columns.

I was surprised - Cluto was extremely fast - the rbr (bisected k-means method) took about 615seconds.
I wrote some code to generate an HTML report for each cluster - so I could look at the queries that were placed in each cluster.

I need to perform several experiments, varying the following parameters:
- number of clusters
- similarity measure (cosine or correlation)
- tf-idf versus feature count

CLUTO provides the following information regarding each cluster:
- internal sim
- internal stdev
- external sim
- external stdev

In addtion, I need to add:
- class purity* (see)
- # of distinct labels per cluster
- # of labeled data set elements per cluster
- ratio

Master's Report

I'm going to be working on Master's Report in the next few months.

At this time (might change), my plan is to re-visit the KDD Cup 2005 competition. I plan to use new techniques to solve the same problem. The advantage is that - I already have an understanding of the problem and I can compare the results of the new techniques with my simplistic Naive Bayes/Active Learning submission.

I hope to use SVMs as my classifier. There are several issues that I'll discuss in the next weeks with this approach.

Hamotzi's Data Mining Log

Thursday, March 23, 2006

Transductive learner

SVM Lite - Preliminary results

Wednesday, March 22, 2006

SVM Light - First look

Update: On Goal for this week

Monday, March 20, 2006

Issue: Parameter tuning

Sunday, March 19, 2006

Goal for this week

Thursday, March 16, 2006

Project approach

Project approach

Issue: Understanding CLUTO metrics

Determining prototypical vectors

Issue: Databases

Issue: tf-idf calculation issue

Document representation for Clustering

Wednesday, March 15, 2006

Sample results from Google API

Issue: Manual enhancement of the keywords for the categories

Issue: Training data set size

Use of Google API

Issue: Calculating Class purity

Manually labeling prototypes

Issue: Determining prototypical vectors using CLUTO

Clustering Objective

CLUTO 2.1.1 experience

Master's Report

About Me

Links

Previous Posts

Archives