Tuesday, August 16, 2005

Paper review: A Simple Introduction to Maximum Entropy Models for Natural Language Processing

A Simple Introduction to Maximum Entropy Models for Natural Language Processing

Overview
Natural language processing problems can viewed as linguistic classification problems. These involve using the linguistic "contexts" to predict linguistic classes. Maximum entropy models are mathematical models that allow us to estimate the probability of a linguistic class occuring within a linguistic context by using different pieces of "contextual evidence" available. This paper describes the math behind this technique and provides a simple example.

Interesting points/concepts
  1. Why Maximum entropy?
    • "The ME model is convenient for natural language processing because it allows the unrestricted use of contextual features and combines them in a pricincipled way"
      • pricinciple here refers to the Principle of Maximum Entropy
    • "the same model can be re-used - instead of creating highly customized problem-specific estimation methods"
  2. Goal - find a method of using the "sparse evidence" about the a's and the b's (in the sample data) to reliably estimate a probability model p(a,b)
    • "a" is a class
    • "b" is the context and can be a word or several words and their syntactic lables
    • p(a,b) - the probability of a class "a" occuring within context "b"
  3. Principle of Maximum Entropy
    • "correct distribution of p(a,b) maximizes entropy (or uncertainity) subject to the constraints which represent 'evidence' - facts known to the experimenter"
    • Pick p that maximizes the entropy H(p) = - Sigma(x, A X B) p(x) log p(x) where A is the set of all possible classes and B is the set of possible contexts, and x=(a,b) and a belongs to A and B belongs to B
  4. What is a feature?
    • useful "facts" are encoded as features
      • constraints are then imposed on the values of these feature expectations
    • A feature is a binary-valued function on events: fj: E -> {0,1}
    • "A feature expresses a co-occurence relation between something on the linguistic context and a particular prediction"
    • f(a,b) = 1 if a = DETERMINER and currentword(b) = "that", else 0 (where a is a part of speech tag and b contains among other things the word to be tagged
  5. Observed exception of a feature
    • E.g. - number of times we would expect to see the word "that" with the tag DETERMINER in the training sample, normalized over the number of training samples
  6. Advantages of maximum entropy framework
    • focus on what features to use and not how to use them
    • the weight of a feature is automatically determined by GIS (Generalized Iterative Scaling algorithm)
  7. Generalized Iterative Scaling
    • finds weights for the features
Conclusion
The concept - at an abstract level - is well described in this paper. A better understanding can be gained by obviously looking at an application of this technique. The pro's and cons are not discussed in this paper.

0 Comments:

Post a Comment

<< Home