Hamotzi's Data Mining Log: Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Author: Thorsten Joachims
Topic: Text Categorization
Approach: Supervised Learning with SVMs

Paper explores the use of SVMs to perform Text Categorization. Claim SVMs rock cause they're fast, robust, efficient and fully automatic - no parameter selection required. (Need a background in SVMs to comprehend the paper)

Interesting points/concepts:

Assignment of text to a category is treated as a binary classification problem - classifier determines if text belongs to this category or not.
Used IDF to build a feature vector.
Used Feature Selection to reduce the dimensions of the Feature Vector - should prevent overfitting.

Several FS opions - DF Thresholding, Chi Square test, term strength criterion
Used information gain criteria as proposed by Yang
?? Feature Selection hurts Text Categorization since there are very few irrelevant features and leads to loss of information

Is'nt this a contradiction?

Hypothesis of why SVMs should work well with Text Categorization

SVMs work well with high d - they do not overfit
SVMs work well with sparse vectors

Comments:

Overview of SVMs is quite high level and theoretical. Details of the algorithm/implementation were not described

Hamotzi's Data Mining Log

Monday, May 30, 2005

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

0 Comments:

About Me

Previous Posts