Sunday, May 29, 2005

Review: Inductive Learning Algorithms and Representations for Text Categorization

Review: Inductive Learning Algorithms and Representations for Text Categorization

Paper compares 5 techniques for Text Categorization based on - learning speed, realtime classification speed and classification accuracy. The supervised learning techniques are - Find Similar, Naive Bayes, Bayesian Networks, Decision Trees and Support Vector Machines (SVM). In their opinion, Linear SVMs (in particular Platt's SMO - Sequential Minimal Optimization) are the most promising as they are accurate, quick to train and quick to evaluate.
Their dataset for comparison is a collection of hand tagged financial stories from Reuters.

Interesting points/concepts:
  1. Text Categorization is the assignment of text to one more predefined categories based on their content.
  2. "Inductive Learning techniques automatically construct classifiers using labeled training data"
  3. Feature Selection is needed to improve efficiency and efficacy
    • They used Mutual Information( feature Xi, Category c)
    • Use this to determine which features should be used
  4. Find Similar Classifier
    • Weight calculated for Terms based on judged relevant and irrelevant documents
  5. Naive Bayes Classifier was found to be very simple and quite effective
  6. Bayes Net
    • They used a 2-dependence Bayesian classifier that allows the probability of each feature to be directly influenced by the appearance/non-appearance of at the most two other features
    • Provided very little improvements over Naive Bayes
  7. SVM
    • Used the simplest linear version of the SVM - fast and accurate classifiers
    • They used a method developed by Platt (see the references) to train the SVM classifier
Comments:
  1. Classifiers were not described well
  2. Naive Bayes seems like a quick and dirty solution that might be good enough
  3. SVM is the best way to perform text categorization?
References of Interest:
  1. "A comparative study on feature selection in text categorization"
  2. Fast Training of Support Vector Machines using Sequential Minimization Optimization
  3. Text categorization with support vector machines: Learning with many relevant features
  4. Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval

0 Comments:

Post a Comment

<< Home