Hamotzi's Data Mining Log: Topic: Using Text for Prediction

This is a summary of a chapter (Using Text Mining for Prediction) from the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information". It provides an overview of the topic and a high level framework for tackling text prediction.

What is prediction?
Using past cases (training set) to project to new examples (test set).

Text prediction is a classification problem
We try to find a function - f: w -> L, where w is a vector of attributes of the data (document, sentence, token, etc.) and L is a label. In statistical terms, we draw a sample from a population, learn a model from it, and then apply the induced model to new, unlabeled examples drawn from the same population. Note that the data might have more than one label.
The classical prediction problem is text categorization. Here a set of categories are predefined and the objective is to assign a category or topic to a new document. So the question in this case is does a new example belong to a particular category?

The pattern assumption
Given a training set, where the documents are labeled, we try to find a pattern of words that is characteristic of each label. We thus use the training set of documents to find patterns relative to a label where the label is the correct answer. This label pattern is the predictor used to determine if an unlabeled document belongs to this label or not.

How much data is needed to recognize a pattern (build a good label predictor)?
Difficult to say by analysis of the sample. You have to look at the results of the predictor to determine if it is good enough or if more data is needed. So its a learning and feedback loop.

Methods for classification
The following are the general categories of the methods recommended for text prediction since they can handle data that is sparse and high dimensional

Similarity and Nearest-neighbor methods
Logic methods
Probabilistic methods
Weighted or Linear Scoring methods

Similarity and Nearest-Neighbor Methods
This is an information retrieval problem. Given a document we try to find the best label for it.
A basic algorithm is:

Compute the similarity of a document to all documents in a collection
Select the k documents that are most similar to a new document
The answer is the label that occurs most frequently in the k selected documents

The general implementation issues are:
- what should k be ? [ it is usually determined by empirical analysis]
- how do you measure similarity? [tf-idf vector encoding with the Cosine similarity is the most popular measure]
- speed (since sequentially comparing new documents to the stored ones is an inefficient process)

Comments:
- labels are secondary to the search process
- Its like a search engine problem since no model or generalized patterns are derived from the data
- It requires little training effort
- These methods require more computation time than most other methods.

Logic Methods/Decision Rules
Given a set of documents that are examples of a label, we capture the pattern present in these documents in a set of rules. When a new unlabeled document is presented, we assign its label depending on whether any of the rules are satisified.
Rules are composed of words - for example a phrase which is a conjunction of words. As a result, the results are easy understand and can be insightful. They suggest reasons for reaching a conclusion. They are far more easier to understand than scores or measures of similarity. However, the rules can be less predictive if the underlying concept is complex. Finding these rules is a complex process and is time-consuming compared to other methods. One has to ensure the rules are not overspecialized and are compact. Techniques such as the weakest-link pruning are used to find these compact set of rules that contain highly predictive wirds.
In comparison to nearest neighbor methods, there's a lot of time spent learning. However, its a worthwile process since the rules can be informative and insightful.

Probabilistic Methods
Given a class label C and a feature vector x that denotes the presence or absence of words from a dictionary, we try to estimate P(C|x). Naive Bayes is an approach for performing this probability estimation. It makes simplifying assumptions - that the words in a document are independent and tries to estimate P(C|x) by using P(xj|C), where xj is the presence or absence of a word in the document.
Naive Bayes is efficient and easy to implement but has poorer precision compared to other methods. It requires little computation or memory but requires a relatively small dictionary.
There are two popular Naive Bayes models for text classification - Multivariate Bernoulli Model and the Multinomial Model. The multinomial model is frequently used as it normalizes the length of a document and thus has better performance.

Weighted or Linear Scoring Methods
For each document D, we calcuate Score(D) = w * x + b, where w is a set of weights assigned to words, x is a binary vector representing absence or presence of the words and b is a constant. The general method is to assign a positive score to predict the positive class and a negative score to predict the negative class. Thus for each word or feature, a weight is assigned. This weight tells us how important each feature is and what feature is important or not.

These techniques work very well on text classification - the best of all techniques. They can learn much faster than any other methods. They can handle very large dictionaries and extremely sparse data. Unlike Naive Bayes, they do not have problems with synonyms and can be used to determine which words are strong or precise predictors.
The math behind determining the weights is quite complex - though the implementations are not. Support Vector Machines is an example of a weighted scoring method.

Hamotzi's Data Mining Log

Sunday, August 28, 2005

Topic: Using Text for Prediction

0 Comments:

About Me

Previous Posts