Thursday, June 30, 2005

KDD Update

So my results using Multinomial Naive Bayes with some tweaking has worked out well.

Using only 1697 features, I could classifify 75K queries.
This is remarkable considering the fact that there are 83K features and 800K queries.

The other "tweak" that I might perform - given the constraint of time and hardware resources - is to "hand tweak" some of the queries.
For e.g. for "legal action against harrasing creditors" suggest "Information\Law & Politics".
legal assistant jobs in news york city => Information\Law & Politics and Living\Career & Jobs

So its a genetic algorithm approach:
- classify the entire 800K
- scan through and pick manually k% "good" categorizations that you want replicated
- pick another k% that sucked and manually pick categories for them
- add these to the training set
- build a model
- run classifier

This iterative approach should produce better results.

Wednesday, June 22, 2005

Use of Naive Bayes along with SVMs

I plan to use Naive Bayes in the following manner:
  • As a pre-processing step for SVMs:
    • Feature Selection - reduce the number of terms drastically to increase SVM scalability
    • Enhancing the training set - increase the size of it
  • As part of an ensemble of techniques - implement 3 techniques (Multivariate, Multinomial and SVM) and use the results from all three to build confidence in the classification

Monday, June 20, 2005

KDD Results

I have my first set of results:
  • Algorithm - Naive Bayes
  • Pre-processing - N-grams
Issues:
  • How do you validate your results?
  • My results did not make sense at all - for e.g. "abusivierelationships" was placed in "Shopping\Online Stores Shopping\Buying Guides Work & Money\Industries Entertainment\Arts & Culture Computing\Hardware" in descending order of probability! I think the category "Personal\Relationships" should have been in there
  • I did not use a binary classifier

Sunday, June 19, 2005

MySQL driver issue?

The setAutocommit(false) does not work - the driver commits all inserts.

KDD Update

We've been heads down cranking away.
  • I've switched gears - using Naive Bayes instead of SVM (explanation later).
  • I need to read up on MySQL scalability - there's a book out by Jermey Z. which seems interesting
  • I need to purchase a Profiler since I keep running out of memory
  • I abandoned Hibernate since I did not have the time to read their documentation and the performance was quite bad (could be my code or the db schema)

Tuesday, June 07, 2005

The pancea for feature reduction?

Proposes the use of Soundex and N-grams for feature reduction

Sunday, June 05, 2005

SVM Software: SVM-Light

Software Reference: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

No updates have been made since 2002. Though it might have some good algorithms for data pre-processing.

KDD Cup: Crossroads

A lot of papers claim SVM works well in high-d. In the examples they provide, high-d is roughly in the tens of thousands. What if you have a dataset that has 200,000 dimensions? The paper "A Divisive Information-Theoretic Feature Clustering" claims SVMs are problematic in high-d. It is however, not a well cited paper (only 5 citations) and uses Linear SVMs. Could it also be that the issue was with the software they used?

Paper Review: A Divisive Information-Theoretic Feature Clustering

A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification
Author: Inderjit Dhillon

Overivew
Text Classification of high-d data using SVMs is challenging. They propose using feature clustering instead. They propose a global criterion for feature clustering and an algorith, for clustering based on this objective function. Their experiments contrast their approach with SVM and Naive Bayes.

Interesting points/concepts
  1. Claim - dimensionality of 14,538 can be a severe obstacle for classification algorithms based on SVMs, LDAs, k-nearest neighbors
  2. "One can reduce dimensionality by the distributional clustering of words and features. Each word cluster can then be teated as a single feature and dimensionality can be drastically reduced. Feature Clustering is more effective than feature selection. "
    1. Feature Clustering? is better than feature selection for reducing dimensionality
  3. "Extend any classifier to perform hierarchical classification by constructing a (distinct) classifier at each internal node of the tree using all the documents in its child nodes as the training data. Thus the tree is assumed to be “is-a” hierarchy, i.e., the training instances are inherited by the parents."
  4. "Hierarchical classification along with feature selection has been shown to achieve better
    classification results than a flat classifier"
Comments

References

Saturday, June 04, 2005

Notes from a "Practical guide to SVM Classification"

A Practical Guide to SVM Classification

Recommended steps:
  1. Transform data into SVM format
  2. Scale data [either [-1,+1] or [0,1]
  3. Try the RBF kernel function
  4. Use cross-validation to find the best parameters - C (penalty parameter of the error term) and Gamma - a parameter for the RBF kernel function
  5. Train with these parameters
  6. Test
Issues:
  1. Do not have enough data to perform cross validation to determine the parameters
  2. Examples provided in paper show trivial datasets - 10's of features and 1000's of data items

Scalability approach?

Given a testing dataset of 800,000 records, what is the effect of splitting the dataset into smaller chunks and performing the prediction in parallel? The smaller datasets may not have any or have too few data items that belong to each of the categories or some categories may be over-represented. What happens???
The advantage of splitting is self-evident.

KDD Cup: Using SVMs

Why SVMs?
  1. Several studies (see entries in blog) say its a good technique
Software
So far, I have found only one
Its a two step process - learn on the training set and predict using the testing set.
The window version seems buggy - it ran much faster on my less powerful Linux box.

Data format
  1. Treat each query as a document
  2. Training set:
    • ...
    • What if there are several classes?
  3. Testing set:
Open Issues:
  1. Scalability - what's the max size of dimensions and data that it can process in a reasonable amount of time? (could not process 100,000 records and 500,000 features)
  2. Data pre-processing
  3. There are several kernel functions and associated parameters. By trial and error, need to determine which should be used.
  4. Need to build a smaller subset of the data for experimenting

Supervised Learning Approaches

Approaches for Text Categorization
  1. SVM (Support Vector Machines)
  2. Decision Trees
  3. Neural Networks

KDD Cup: Brain Storming

Learning algorithm options:
  1. Supervised Learning algorithms, or
  2. Unsupervised Learning algorithms or
  3. Latent Semantic Indexing + Supervised Learning?
Data Pre-processing
  1. Use standard techniques - Porter stemming, etc,
  2. Custom code based on analyzing characteristics of the dataset
  3. Build a custom synonym dictionary? (or use LSI techniques?)
  4. Ploysomy is not an issue because we can map the query to upto 5 categories
  5. Need to greatly reduce the number of unique words from about 799,000+ to something more "manageable" - depends on the algorithm what "manageable" means
  6. Need to enhance the training dataset to ensure the training set contains examples of all categories
Determining validity of results
  1. Hand build a custom validation set

KDD Cup: The Challenge

It's a Text Categorization challenge.
You're provided with 3 files:
  1. Categories file. Contains a list of 74 Categories.
  2. Queries file. Contains a list of 800,000 queries that need to be categorized/classified. A query can belong to at the most 5 categories.
  3. CategoriesQueriesSample file (training file). Contains 26 queries mapped to 52 of the 74 categories.
Categories
  • The categories are all hierarchical - "Shopping\Online Stores"
  • Not all categories are present in the training dataset (22 missing) which is an issue.
Queries
  • Each query consists of a number of terms
  • Preparing the query date for training and prediciting, is one of the fundemental challenges of this competition. This involves:
    • Decomposing queries into a bag of words
    • Cleaning the words (stemming, splitting, etc.)
  • Size of the dataset and size of the features is another challenge
To summarize, the challenges are:
  1. Dataset vast - large number of queries, large number of unique terms (features)
  2. Data pre-processing - Converting queries to bag of words
  3. Training dataset too small
  4. Training dataset does not contain data items for all categories
  5. How do you measure success? No validation data set. Training dataset too small to effectively predict learning error.

Quest for the KDD Cup

We're entering the fray.
Seems dauting - mere CS Master's student's in a fray dominated by Phds and more frighteningly - Math Phds.

But as Herzl said, "But if you will it, it is not fantasy".