Monday, May 30, 2005

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Author: Thorsten Joachims
Topic: Text Categorization
Approach: Supervised Learning with SVMs

Paper explores the use of SVMs to perform Text Categorization. Claim SVMs rock cause they're fast, robust, efficient and fully automatic - no parameter selection required. (Need a background in SVMs to comprehend the paper)

Interesting points/concepts:
  1. Assignment of text to a category is treated as a binary classification problem - classifier determines if text belongs to this category or not.
  2. Used IDF to build a feature vector.
  3. Used Feature Selection to reduce the dimensions of the Feature Vector - should prevent overfitting.
    • Several FS opions - DF Thresholding, Chi Square test, term strength criterion
    • Used information gain criteria as proposed by Yang
    • ?? Feature Selection hurts Text Categorization since there are very few irrelevant features and leads to loss of information
      • Is'nt this a contradiction?
  4. Hypothesis of why SVMs should work well with Text Categorization
    1. SVMs work well with high d - they do not overfit
    2. SVMs work well with sparse vectors
Comments:
  1. Overview of SVMs is quite high level and theoretical. Details of the algorithm/implementation were not described



0 Comments:

Post a Comment

<< Home