Hamotzi's Data Mining Log: June 2005

KDD Update

So my results using Multinomial Naive Bayes with some tweaking has worked out well.

Using only 1697 features, I could classifify 75K queries.
This is remarkable considering the fact that there are 83K features and 800K queries.

The other "tweak" that I might perform - given the constraint of time and hardware resources - is to "hand tweak" some of the queries.
For e.g. for "legal action against harrasing creditors" suggest "Information\Law & Politics".
legal assistant jobs in news york city => Information\Law & Politics and Living\Career & Jobs

So its a genetic algorithm approach:
- classify the entire 800K
- scan through and pick manually k% "good" categorizations that you want replicated
- pick another k% that sucked and manually pick categories for them
- add these to the training set
- build a model
- run classifier

This iterative approach should produce better results.

Use of Naive Bayes along with SVMs

I plan to use Naive Bayes in the following manner:

As a pre-processing step for SVMs:

Feature Selection - reduce the number of terms drastically to increase SVM scalability
Enhancing the training set - increase the size of it

As part of an ensemble of techniques - implement 3 techniques (Multivariate, Multinomial and SVM) and use the results from all three to build confidence in the classification

KDD Results

I have my first set of results:

Algorithm - Naive Bayes
Pre-processing - N-grams

Issues:

How do you validate your results?
My results did not make sense at all - for e.g. "abusivierelationships" was placed in "Shopping\Online Stores Shopping\Buying Guides Work & Money\Industries Entertainment\Arts & Culture Computing\Hardware" in descending order of probability! I think the category "Personal\Relationships" should have been in there
I did not use a binary classifier

MySQL driver issue?

The setAutocommit(false) does not work - the driver commits all inserts.

KDD Update

We've been heads down cranking away.

I've switched gears - using Naive Bayes instead of SVM (explanation later).
I need to read up on MySQL scalability - there's a book out by Jermey Z. which seems interesting
I need to purchase a Profiler since I keep running out of memory
I abandoned Hibernate since I did not have the time to read their documentation and the performance was quite bad (could be my code or the db schema)

The pancea for feature reduction?

Proposes the use of Soundex and N-grams for feature reduction

SVM Software: SVM-Light

SVM-Light

Software Reference: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

No updates have been made since 2002. Though it might have some good algorithms for data pre-processing.

KDD Cup: Crossroads

A lot of papers claim SVM works well in high-d. In the examples they provide, high-d is roughly in the tens of thousands. What if you have a dataset that has 200,000 dimensions? The paper "A Divisive Information-Theoretic Feature Clustering" claims SVMs are problematic in high-d. It is however, not a well cited paper (only 5 citations) and uses Linear SVMs. Could it also be that the issue was with the software they used?

Paper Review: A Divisive Information-Theoretic Feature Clustering

A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification
Author: Inderjit Dhillon

Overivew
Text Classification of high-d data using SVMs is challenging. They propose using feature clustering instead. They propose a global criterion for feature clustering and an algorith, for clustering based on this objective function. Their experiments contrast their approach with SVM and Naive Bayes.

Interesting points/concepts

Claim - dimensionality of 14,538 can be a severe obstacle for classification algorithms based on SVMs, LDAs, k-nearest neighbors
"One can reduce dimensionality by the distributional clustering of words and features. Each word cluster can then be teated as a single feature and dimensionality can be drastically reduced. Feature Clustering is more effective than feature selection. "

Feature Clustering? is better than feature selection for reducing dimensionality

"Extend any classifier to perform hierarchical classification by constructing a (distinct) classifier at each internal node of the tree using all the documents in its child nodes as the training data. Thus the tree is assumed to be “is-a” hierarchy, i.e., the training instances are inherited by the parents."
"Hierarchical classification along with feature selection has been shown to achieve better
classification results than a flat classifier"

Comments

References

Notes from a "Practical guide to SVM Classification"

A Practical Guide to SVM Classification

Recommended steps:

Transform data into SVM format
Scale data [either [-1,+1] or [0,1]
Try the RBF kernel function
Use cross-validation to find the best parameters - C (penalty parameter of the error term) and Gamma - a parameter for the RBF kernel function
Train with these parameters
Test

Issues:

Do not have enough data to perform cross validation to determine the parameters
Examples provided in paper show trivial datasets - 10's of features and 1000's of data items

Scalability approach?

Given a testing dataset of 800,000 records, what is the effect of splitting the dataset into smaller chunks and performing the prediction in parallel? The smaller datasets may not have any or have too few data items that belong to each of the categories or some categories may be over-represented. What happens???
The advantage of splitting is self-evident.

KDD Cup: Using SVMs

Why SVMs?

Several studies (see entries in blog) say its a good technique

Software
So far, I have found only one
Its a two step process - learn on the training set and predict using the testing set.
The window version seems buggy - it ran much faster on my less powerful Linux box.

Data format

Treat each query as a document
Training set:

...
What if there are several classes?

Testing set:

Open Issues:

Scalability - what's the max size of dimensions and data that it can process in a reasonable amount of time? (could not process 100,000 records and 500,000 features)
Data pre-processing
There are several kernel functions and associated parameters. By trial and error, need to determine which should be used.
Need to build a smaller subset of the data for experimenting

Supervised Learning Approaches

Approaches for Text Categorization

SVM (Support Vector Machines)
Decision Trees
Neural Networks

KDD Cup: Brain Storming

Learning algorithm options:

Supervised Learning algorithms, or
Unsupervised Learning algorithms or
Latent Semantic Indexing + Supervised Learning?

Data Pre-processing

Use standard techniques - Porter stemming, etc,
Custom code based on analyzing characteristics of the dataset
Build a custom synonym dictionary? (or use LSI techniques?)
Ploysomy is not an issue because we can map the query to upto 5 categories
Need to greatly reduce the number of unique words from about 799,000+ to something more "manageable" - depends on the algorithm what "manageable" means
Need to enhance the training dataset to ensure the training set contains examples of all categories

Determining validity of results

Hand build a custom validation set

KDD Cup: The Challenge

It's a Text Categorization challenge.
You're provided with 3 files:

Categories file. Contains a list of 74 Categories.
Queries file. Contains a list of 800,000 queries that need to be categorized/classified. A query can belong to at the most 5 categories.
CategoriesQueriesSample file (training file). Contains 26 queries mapped to 52 of the 74 categories.

Quest for the KDD Cup

We're entering the fray.
Seems dauting - mere CS Master's student's in a fray dominated by Phds and more frighteningly - Math Phds.

But as Herzl said, "But if you will it, it is not fantasy".

Hamotzi's Data Mining Log

Thursday, June 30, 2005

KDD Update

Wednesday, June 22, 2005

Use of Naive Bayes along with SVMs

Monday, June 20, 2005

KDD Results

Sunday, June 19, 2005

MySQL driver issue?

KDD Update

Tuesday, June 07, 2005

The pancea for feature reduction?

Sunday, June 05, 2005

SVM Software: SVM-Light

Software Reference: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

KDD Cup: Crossroads

Paper Review: A Divisive Information-Theoretic Feature Clustering

Saturday, June 04, 2005

Notes from a "Practical guide to SVM Classification"

Scalability approach?

KDD Cup: Using SVMs

Supervised Learning Approaches

KDD Cup: Brain Storming

KDD Cup: The Challenge

Quest for the KDD Cup

About Me

Links

Previous Posts

Archives