Hamotzi's Data Mining Log: August 2005

Topic: Document Clustering

My notes from the chapter "Finding Structure in a Document Collection" in the book "Text Mining: Predictive Methods for Analyzing Unstructured Information". This chapter provides an overview of document clustering - why cluster, issues involved, techniques, etc.

In document clustering, similar documents are grouped together in smaller subsets called clusters. Ideally, documents in a cluster are very similar to each other and quite dissimilar to documents in other clusters. Prior to clustering, the labels are unknown. Only a similarity measure is used to group the documents. After clustering, the clusters are examined - one could automatically generate labels and also automatically label the clusters. In this way, you completely by pass the expensive process of having humans come up with labels and generate a labeled training set. This approach however is less than perfect (why?). In spite of this (supposed) flaw, document clustering can be a useful task as it gives additional structure to the data. The clusters formed can be informative. Document clustering can provide a means for further analysis and prediction. For e.g., if we know little about the structure of documents - clustering can be used to get a better understanding of the types of documents - features they relate by, discriminating and descriptive features, etc.

Clustering mechanisms

k-means
Hierarchical Clustering
EM Algorithm

k-means
It is widely used for document clustering and is relatively efficient.

Hierarchical Clustering
Agglomerative clustering organizes documents hierarchically. Similarity is decided by averages and also by maximum and minimum distances between documents within a cluster. The major issue of these techniques are timing and computational complexity.

EM Algorithm
This is a "soft clustering" algorithm where documents belong to each cluster with a certain probability. The sum of these probabilities is 1. This method uses sophisticated statistical modeling techniques and there is "very widely used " (in my experience, it is incredibly slow and is not scalable).

Understanding the Clusters
To enable further analysis of the data, one could, for each cluster, generate a set of discriminative and descriptive keywords. There are several ways to do this - these could be the most frequently occuring words in the cluster, or words with high tf-idf measures. The keywords can be presented to a human who can then generate a label for the documents in the cluster. We could also pick exemplars from each cluster. These are the prototypes or most representative documents from each cluster. For e.g. they could be the documents that are the most similar to the mean vector of the cluster. This can make exploratory data analysis more efficient since the human expert can now view the summarized properties of a document collection.

Determining the quality of clusters
We can compute the average similarity between the documents in a cluster - and the variance in similarity. If the variance is high, then the cluster is of poor quality. The number of clusters can make a big difference in the quality of clusters. Larger the number of clusters - lower the variance - but also this will affect the usability of these clusters. Manual evaluation of cluster results is the best solution. Since we can read and comprehend documents, we can examine the clustering results and evaluate the results on our own.

References

Link analysis

In a previous post (Web-based Document Search challenge), I talked about the issue of performing simple similarity based searches for retrieving documents.

One solution as per the book - ""Text Mining: Predictive Methods for Analyzing Unstructured Information" is to perform Link analysis. Google uses a PageRank algorithm. The rank of a document is determined by the rank of the papers that link to it. A document should be ranked highly if it is cited by another highly-ranked document.

Academic documents can be also be ranked based on this citation analysis. If a document is cited by highly-ranked documents, then it should be highly-ranked as well.

References

Web-based Document Search Challenge

The book - "Text Mining: Predictive Methods for Analyzing Unstructured Information" makes an interesting comparison between an Information Retrieval application (such as Web-based search) and Text categorization.

Text categorization/prediction is a classification problem. We're concerned with determining a label (or a set of labels) for a new document based on its similarity to known, labeled documents. In IR, one uses a query (can be considered as a small document) to perform a similarity based search with documents in a document collection. The difference being, we are interested in ranking the search results rather than categorizing the query. Since the query contains very few words and the number of documents to be searched is massive, a large number of documents will have all the key words in the query and will thus have nearly identical similarity scores. The returned documents cannot be ranked based on these nearly identical similarity scores. It will be the responsibility of the user to refocus the query by adding more words or by substituting more specific words to find the documents they are looking for. This is the challenge of Web-based document search.

The general approarch to this problem is to perform document link analysis in a query-independent manner. Each document is given a score by a page rank function. The larger the score, the higher the quality of the page. The documents returned to the user are ranked based on these scores.

tf-idf and Cosine Similarity

tf-idf
In text mining, documents are represented by vectors. The components of the vector are the weight of the words that appear in the document. The weight is computed using tf-idf (term frequency-inverse document frequency).
tf-idf(word) = term frequency in the document(word) * log-base-2(Number of documents in the collection/number of documents the word appears in(word)). tf-idf measure can be considered as a measure of importance or relevance to the document. tf-idf gives extra weighting to high-frequency words and words that are relatively unique.

Cosine Similarity
The most popular document similarity measure is Cosine Similarity. Cosine similarity is the cosine of the angle between two vectors. If vectors are orthogonal, the similarity is 0. If they are identical, the similarity is 1. Cosine(V1, V2) = ( V1 * V2 ) / (|V1| * |V2| ). V1 * V2 = v1i * v2i + v1i+1 * v2i+1 + ... + v1n * v2n. |V1| = v1i^2 + ... + v1n^2. The document vectors are normally represented using tf-idf. By using the length of the unit vectors in the denominator, we normalize the frequency information in the tf-idf weights since documents are of variable length.

Topic: Using Text for Prediction

This is a summary of a chapter (Using Text Mining for Prediction) from the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information". It provides an overview of the topic and a high level framework for tackling text prediction.

What is prediction?
Using past cases (training set) to project to new examples (test set).

Text prediction is a classification problem
We try to find a function - f: w -> L, where w is a vector of attributes of the data (document, sentence, token, etc.) and L is a label. In statistical terms, we draw a sample from a population, learn a model from it, and then apply the induced model to new, unlabeled examples drawn from the same population. Note that the data might have more than one label.
The classical prediction problem is text categorization. Here a set of categories are predefined and the objective is to assign a category or topic to a new document. So the question in this case is does a new example belong to a particular category?

The pattern assumption
Given a training set, where the documents are labeled, we try to find a pattern of words that is characteristic of each label. We thus use the training set of documents to find patterns relative to a label where the label is the correct answer. This label pattern is the predictor used to determine if an unlabeled document belongs to this label or not.

How much data is needed to recognize a pattern (build a good label predictor)?
Difficult to say by analysis of the sample. You have to look at the results of the predictor to determine if it is good enough or if more data is needed. So its a learning and feedback loop.

Methods for classification
The following are the general categories of the methods recommended for text prediction since they can handle data that is sparse and high dimensional

Similarity and Nearest-neighbor methods
Logic methods
Probabilistic methods
Weighted or Linear Scoring methods

Similarity and Nearest-Neighbor Methods
This is an information retrieval problem. Given a document we try to find the best label for it.
A basic algorithm is:

Compute the similarity of a document to all documents in a collection
Select the k documents that are most similar to a new document
The answer is the label that occurs most frequently in the k selected documents

The general implementation issues are:
- what should k be ? [ it is usually determined by empirical analysis]
- how do you measure similarity? [tf-idf vector encoding with the Cosine similarity is the most popular measure]
- speed (since sequentially comparing new documents to the stored ones is an inefficient process)

Comments:
- labels are secondary to the search process
- Its like a search engine problem since no model or generalized patterns are derived from the data
- It requires little training effort
- These methods require more computation time than most other methods.

Logic Methods/Decision Rules
Given a set of documents that are examples of a label, we capture the pattern present in these documents in a set of rules. When a new unlabeled document is presented, we assign its label depending on whether any of the rules are satisified.
Rules are composed of words - for example a phrase which is a conjunction of words. As a result, the results are easy understand and can be insightful. They suggest reasons for reaching a conclusion. They are far more easier to understand than scores or measures of similarity. However, the rules can be less predictive if the underlying concept is complex. Finding these rules is a complex process and is time-consuming compared to other methods. One has to ensure the rules are not overspecialized and are compact. Techniques such as the weakest-link pruning are used to find these compact set of rules that contain highly predictive wirds.
In comparison to nearest neighbor methods, there's a lot of time spent learning. However, its a worthwile process since the rules can be informative and insightful.

Probabilistic Methods
Given a class label C and a feature vector x that denotes the presence or absence of words from a dictionary, we try to estimate P(C|x). Naive Bayes is an approach for performing this probability estimation. It makes simplifying assumptions - that the words in a document are independent and tries to estimate P(C|x) by using P(xj|C), where xj is the presence or absence of a word in the document.
Naive Bayes is efficient and easy to implement but has poorer precision compared to other methods. It requires little computation or memory but requires a relatively small dictionary.
There are two popular Naive Bayes models for text classification - Multivariate Bernoulli Model and the Multinomial Model. The multinomial model is frequently used as it normalizes the length of a document and thus has better performance.

Weighted or Linear Scoring Methods
For each document D, we calcuate Score(D) = w * x + b, where w is a set of weights assigned to words, x is a binary vector representing absence or presence of the words and b is a constant. The general method is to assign a positive score to predict the positive class and a negative score to predict the negative class. Thus for each word or feature, a weight is assigned. This weight tells us how important each feature is and what feature is important or not.

These techniques work very well on text classification - the best of all techniques. They can learn much faster than any other methods. They can handle very large dictionaries and extremely sparse data. Unlike Naive Bayes, they do not have problems with synonyms and can be used to determine which words are strong or precise predictors.
The math behind determining the weights is quite complex - though the implementations are not. Support Vector Machines is an example of a weighted scoring method.

Website: Statistical NLP/Corpus-based Computational Linguistics

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
Excellent website - tons of tools.

Topic: Full sentence parsing (Text processing)

My notes from the the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information".

Sometimes we need to perform a full parse of a sentence. The sentence is converted into a single structure such as a tree or a directed acyclic graph. Each word in the sentence is present in this structure. The structure is used to find the relation of each word in a sentence to all the others and also to find the function of the word in the sentence - is it a subject, object, etc.
There are several kinds of parses - e.g. Content-free parsers. There are a number of algorithms to produce such a tree. The Wall Street Journal corpus available from LDC (see "Corpora for Text Mining").
It is an expensive process - but sometimes is needed since it provides information that phrase identification ir partial parsing cannot provide. For e.g. the sentence "Johnson was replaced at IBM by Smith" is problematic for analysis without the use of full text parsing. We may wrongly conclude that Smith was replaced by Johnson because of the passive structure of this sentence.

Topic: Named Entity Recognition (Text Processing)

My notes from the the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information"

Named Entity Recognition is a type of phrase finding. It is focused on noun phrase finding - specifically the recognition of particular types of proper noun phrases such as persons, organizations, locations, money, date, times, etc.
Its used for Information Extraction - for e.g. for summarizing documents.
Typically, one performs Phrase recognition as a first step. Then the noun phrases are then further classified into classes such as I-person, B-location, B-person, etc.

Topic: Phrase Recognition (Text Processing)

My notes from the the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information".

Phrase Recognition is useful for creating a "partial parse" of a sentence and as a step in identifying the "Named Entities" occuring in a sentence. It is a data pre-processing step that is performed after the tokens in the sentences have been tagged by their Parts of Speech.
Phrase Recognition systems are supposed to scan a text and mark the beginnings and ends of phrases. Types of phrases are Noun phrases, Verb phrases and Prepositional phrases. One convention is to mark a word inside a phrase with "I-", a word at the beginning of a phrase adjacent to another phrase with B- and a word outside any phrase with O-. To the I- and B- tags we then add a code for the phrase type - e.g I-NP (Noun phrase).
This can be considered as a classification problem for the tokens of a sentence. There are several corpora available for developing and testing phrase recognition systems. Performance of these systems varies widely over phrase type - overall, its pretty good.

Topic: Word Sense Disambiguation (Text Processing)

My notes from the the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information"

This is an optional data pre-processing step. It is performed because English words are ambiguous (as to their meaning or reference) - even when they are tagged with their parts of speech. For example, the word "bore" can reference a hole (the bore is not large enough) or a person (he is a bore).
The Wordnet project aims to help disambiguate words. It focuses on word meanings and their inter-relationships. It however does not provide an algorithm for selecting a particular meaning for a word in a context.
There are no algorithms that can completely disambiguate text. This is due to the lack of a corpora of disambiguated text required to train machine-learning algorithms.
So - this pre-processing step is generally avoided.

Other References
Word Sense disambiguation: The state of the art.
(the paper sucks - but has great references)

Topic: Part-Of-Speech Tagging (Text Processing)

My notes from the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information".
For Information Extraction - for e.g. for extracting names of people, places and organizations, one needs to perform linguistic analysis on the text and extract more sophisticated features. Towards this goal, one performs Part of Speech tagging for each token in the text (after Sentence boundaries have been determined).
In any natural language, words are organized into grammatical classes or parts of speech. Almost all languages have categories such as verbs and nouns. The number of categories depends on the language and how the language is analyzed by a linguist.
In English, some analyses report as low as 6 or 7 categories and some as high as a hundred categories. Examples of categories are nouns, adjectives, adverbs, prepositions and conjunctions. One could lookup a token in a dictionary to determine its POS. However many words are ambiguous and could correspond to several parts of speech. For e.g. the word "bore" could be a noun, a present tense verb or a past tense verb. So machine learning techniques are usually used to perform automatic POS tagging.
For training, one needs a corpus. The Wall Street Journal Corpus (available at www.ldc.upenn.edu) is the largest annotated corpus available. It has 36 categories -includes categories such as "Foreign Word", "Determiner", "Verb base form", etc.
However, training on this corpus will not help in analyzing email messages for example.

References
Maximum Entropy part of speech tagger

Topic: Sentence Boundary Detection (Text Processing)

Information Extraction algorithms require sophisticated linguistic parsing. They often operate on text a sentence at a time. For example to identify the parts of speech of each word in a text, the text first has to be segmented into boundaries since the influence of one word on the part-of-speech of another does not cross sentence boundaries.
Sentence boundary detection is essentially the problem of deciding which characters (such as periods in English) in the text are sentence delimeters and which are not.
This can be treated as a classification problem and in some studies accuracy of 98% has been achieved by using machine learning techniques. If training data is not available, one can use a handcrafted algorithm - which will be tailored to a particular language.

References:
A comparison of paradigms for improving MT quality

Statistical Data Mining Tutorials

Statistical Data Mining Tutorials
Light - but presents a good overview of the math behind the techniques

Corpora for text mining

For research and development of text-mining techniques, there are a number of corpora available.
RCV1 Reuters Corpus
TREC Collections
Linguistic Data Consortium
ICAME (International Computer Archive of Modern and Medieval English)
TEI (Text Encoding Initiative)
Corpus Linguistics Links

Conferences and Competitions

Great resources for publications and competitions!
Message Understanding Conference (MUC)
Text Retrieval Conference (TREC)
Conference on Natural Language Learning (CoNLL)
Automatic Content Extraction (ACE)
Association for Computational Linguistics
International Joint Conference on Artificial Intelligence (IJCAI)
Internal Machine Learning Conference

Reference: Information Theory, Inference, and Learning Algorithms

Free online book
Starts math at the ground level.
Course Notes

Paper Review: A Mathematical Theory of Communication

A Mathematical Theory of Communication
Shannon's classic paper on Information Theory.

The following example gives a flavor of the objective:
We mathematically describe the information in bits per second that is produced by an information source. This is the capacity of the information source. We then try to reduce the required capacity by using the "statistical knowledge about the source". For e.g. in telegraphy - the messages to be transmitted are a sequence of letters - which are not completely random as they form sentences and have the statistical structure of languages such as English. The letter E for example, is more frequent than the letter Z, the sequence TH is more frequent than XP. We harness the understanding of this structure to reduce channel capacity (or reduce the time required to send messages). For e.g. E is encoded as a dot while infrequent letters such as Q, X and X are represented by a longer sequences of dots and dashes.

Some background:
Information sources are discrete sources that generate messages symbol by symbol. The discrete source will choose successive symbols according to certain probabilities depending on preceding choices as well the particular symbols in question. These information sources are represented by a stochastic process. E.g. natural languages such as English, etc.

Interesting properties of the entropy function:

H = 0 iff all the probabilities of the possible events p(i) - but one - are zero. When we are certain of an outcome, then uncertainity (H) is 0
For a given n (the number of all possible events), H is max and equal to log n when all the p(i) are equal. So if n is 2, and p(1) = p(2), then H is maximum and is 0.5
If there are 2 events x and y, then H(x,y) <= H(x) + H(y). If x and y are independent then H(x,y) = H(x) + H(y)
Given x and y, H(y) >= Hx(y). The uncertainity of y is never increased by the knowledge of x. It will be decreased - unless x and y are independent - in which case uncertainity remains unchanged.

Relative Entropy
The ration of the entropy of a source to the maximum value it could have while still restricted to teh same symbols is called its relative entropy. THis is the maximum compression possible when we encode into the same alphabet.
Redundancy
This is one minus the relative entropy. The redundancy of English - not considering statistical structure over distances greater than 8 letters is about 50%. This means when we write English, half of what we write is determined by the structure of the language and half is chosen freely.

Examples
Given 4 symbols - A, B, C and D with probabilities of 1/2, 1/4, 1/8 and 1/8.
H = (1/2 log 1/2 + 1/4 log 1/4 + 1/8 log 1/8 + 1/8 log 1/8) = 7/4 bits per symbol
The symbols can be encoded as A => 0, B => 10, C=> 110 and D => 111. The average number of binary digits used in encoding a sequence of N symbols will be N ( 1/2 * 1 + 1/4 * 2 + 2/8 * 3 ) = 7/4 N.

Information Entropy Notes (Wikipedia)

Notes on Information Entropy from Wikipedia.

Entropy can be defined as the "amount of information carried in a signal". A "signal" can be a string of characters. Alternatively, entropy is the "amount of randomness in a signal or a random event". This concept was proposed by Shannon in the paper "A Mathematical Theory of Communication".
As per Shannon's definition of entropy, entropy is the "minimum channel capacity required to reliably transmit the source as encoded binary digits" (thus the base2-log of the probability of the random event in the math formula). Entropy is the "mathematical expectation of the amount of information carried in a digit fom the information source". It is also the "measure of the uncertainty about the 'realization' of a random variable". For a data source, entropy is the average number of bits per symbol needed to encode it.
Examples:

Fair coin flip - entropy is 1 bit per flip?
If a system generates the same output then its entropy is 0
entropy of English text is about 1.5 bits per character

Paper review: A Simple Introduction to Maximum Entropy Models for Natural Language Processing

A Simple Introduction to Maximum Entropy Models for Natural Language Processing

Overview
Natural language processing problems can viewed as linguistic classification problems. These involve using the linguistic "contexts" to predict linguistic classes. Maximum entropy models are mathematical models that allow us to estimate the probability of a linguistic class occuring within a linguistic context by using different pieces of "contextual evidence" available. This paper describes the math behind this technique and provides a simple example.

Interesting points/concepts

Why Maximum entropy?

"The ME model is convenient for natural language processing because it allows the unrestricted use of contextual features and combines them in a pricincipled way"

pricinciple here refers to the Principle of Maximum Entropy

"the same model can be re-used - instead of creating highly customized problem-specific estimation methods"

Goal - find a method of using the "sparse evidence" about the a's and the b's (in the sample data) to reliably estimate a probability model p(a,b)

"a" is a class
"b" is the context and can be a word or several words and their syntactic lables
p(a,b) - the probability of a class "a" occuring within context "b"

Principle of Maximum Entropy

"correct distribution of p(a,b) maximizes entropy (or uncertainity) subject to the constraints which represent 'evidence' - facts known to the experimenter"
Pick p that maximizes the entropy H(p) = - Sigma(x, A X B) p(x) log p(x) where A is the set of all possible classes and B is the set of possible contexts, and x=(a,b) and a belongs to A and B belongs to B

What is a feature?

useful "facts" are encoded as features

constraints are then imposed on the values of these feature expectations

A feature is a binary-valued function on events: fj: E -> {0,1}
"A feature expresses a co-occurence relation between something on the linguistic context and a particular prediction"
f(a,b) = 1 if a = DETERMINER and currentword(b) = "that", else 0 (where a is a part of speech tag and b contains among other things the word to be tagged

Observed exception of a feature

E.g. - number of times we would expect to see the word "that" with the tag DETERMINER in the training sample, normalized over the number of training samples

Advantages of maximum entropy framework

focus on what features to use and not how to use them
the weight of a feature is automatically determined by GIS (Generalized Iterative Scaling algorithm)

Generalized Iterative Scaling

finds weights for the features

Conclusion
The concept - at an abstract level - is well described in this paper. A better understanding can be gained by obviously looking at an application of this technique. The pro's and cons are not discussed in this paper.

Software: Wordnet

Wordnet

Hamotzi's Data Mining Log

Tuesday, August 30, 2005

Topic: Document Clustering

Monday, August 29, 2005

Link analysis

Web-based Document Search Challenge

tf-idf and Cosine Similarity

Sunday, August 28, 2005

Topic: Using Text for Prediction

Saturday, August 27, 2005

Website: Statistical NLP/Corpus-based Computational Linguistics

Topic: Full sentence parsing (Text processing)

Topic: Named Entity Recognition (Text Processing)

Topic: Phrase Recognition (Text Processing)

Topic: Word Sense Disambiguation (Text Processing)

Topic: Part-Of-Speech Tagging (Text Processing)

Topic: Sentence Boundary Detection (Text Processing)

Thursday, August 25, 2005

Statistical Data Mining Tutorials

Corpora for text mining

Conferences and Competitions

Monday, August 22, 2005

Reference: Information Theory, Inference, and Learning Algorithms

Paper Review: A Mathematical Theory of Communication

Information Entropy Notes (Wikipedia)

Tuesday, August 16, 2005

Paper review: A Simple Introduction to Maximum Entropy Models for Natural Language Processing

Software: Wordnet

About Me

Links

Previous Posts

Archives