Review: Inductive Learning Algorithms and Representations for Text Categorization
Review: Inductive Learning Algorithms and Representations for Text Categorization
Paper compares 5 techniques for Text Categorization based on - learning speed, realtime classification speed and classification accuracy. The supervised learning techniques are - Find Similar, Naive Bayes, Bayesian Networks, Decision Trees and Support Vector Machines (SVM). In their opinion, Linear SVMs (in particular Platt's SMO - Sequential Minimal Optimization) are the most promising as they are accurate, quick to train and quick to evaluate.
Their dataset for comparison is a collection of hand tagged financial stories from Reuters.
Interesting points/concepts:
Paper compares 5 techniques for Text Categorization based on - learning speed, realtime classification speed and classification accuracy. The supervised learning techniques are - Find Similar, Naive Bayes, Bayesian Networks, Decision Trees and Support Vector Machines (SVM). In their opinion, Linear SVMs (in particular Platt's SMO - Sequential Minimal Optimization) are the most promising as they are accurate, quick to train and quick to evaluate.
Their dataset for comparison is a collection of hand tagged financial stories from Reuters.
Interesting points/concepts:
- Text Categorization is the assignment of text to one more predefined categories based on their content.
- "Inductive Learning techniques automatically construct classifiers using labeled training data"
- Feature Selection is needed to improve efficiency and efficacy
- They used Mutual Information( feature Xi, Category c)
- Use this to determine which features should be used
- Find Similar Classifier
- Weight calculated for Terms based on judged relevant and irrelevant documents
- Naive Bayes Classifier was found to be very simple and quite effective
- Bayes Net
- They used a 2-dependence Bayesian classifier that allows the probability of each feature to be directly influenced by the appearance/non-appearance of at the most two other features
- Provided very little improvements over Naive Bayes
- SVM
- Used the simplest linear version of the SVM - fast and accurate classifiers
- They used a method developed by Platt (see the references) to train the SVM classifier
- Classifiers were not described well
- Naive Bayes seems like a quick and dirty solution that might be good enough
- SVM is the best way to perform text categorization?
- "A comparative study on feature selection in text categorization"
- Fast Training of Support Vector Machines using Sequential Minimization Optimization
- Text categorization with support vector machines: Learning with many relevant features
- Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval
0 Comments:
Post a Comment
<< Home