Hamotzi's Data Mining Log: Review: Inductive Learning Algorithms and Representations for Text Categorization

Review: Inductive Learning Algorithms and Representations for Text Categorization

Paper compares 5 techniques for Text Categorization based on - learning speed, realtime classification speed and classification accuracy. The supervised learning techniques are - Find Similar, Naive Bayes, Bayesian Networks, Decision Trees and Support Vector Machines (SVM). In their opinion, Linear SVMs (in particular Platt's SMO - Sequential Minimal Optimization) are the most promising as they are accurate, quick to train and quick to evaluate.
Their dataset for comparison is a collection of hand tagged financial stories from Reuters.

Interesting points/concepts:

Text Categorization is the assignment of text to one more predefined categories based on their content.
"Inductive Learning techniques automatically construct classifiers using labeled training data"
Feature Selection is needed to improve efficiency and efficacy

They used Mutual Information( feature Xi, Category c)
Use this to determine which features should be used

Find Similar Classifier

Weight calculated for Terms based on judged relevant and irrelevant documents

Naive Bayes Classifier was found to be very simple and quite effective
Bayes Net

They used a 2-dependence Bayesian classifier that allows the probability of each feature to be directly influenced by the appearance/non-appearance of at the most two other features
Provided very little improvements over Naive Bayes

Used the simplest linear version of the SVM - fast and accurate classifiers
They used a method developed by Platt (see the references) to train the SVM classifier

Comments:

Classifiers were not described well
Naive Bayes seems like a quick and dirty solution that might be good enough
SVM is the best way to perform text categorization?

References of Interest:

"A comparative study on feature selection in text categorization"
Fast Training of Support Vector Machines using Sequential Minimization Optimization
Text categorization with support vector machines: Learning with many relevant features
Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval

Hamotzi's Data Mining Log

Sunday, May 29, 2005

Review: Inductive Learning Algorithms and Representations for Text Categorization

0 Comments:

About Me

Previous Posts