Saturday, August 27, 2005

Topic: Part-Of-Speech Tagging (Text Processing)

My notes from the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information".
For Information Extraction - for e.g. for extracting names of people, places and organizations, one needs to perform linguistic analysis on the text and extract more sophisticated features. Towards this goal, one performs Part of Speech tagging for each token in the text (after Sentence boundaries have been determined).
In any natural language, words are organized into grammatical classes or parts of speech. Almost all languages have categories such as verbs and nouns. The number of categories depends on the language and how the language is analyzed by a linguist.
In English, some analyses report as low as 6 or 7 categories and some as high as a hundred categories. Examples of categories are nouns, adjectives, adverbs, prepositions and conjunctions. One could lookup a token in a dictionary to determine its POS. However many words are ambiguous and could correspond to several parts of speech. For e.g. the word "bore" could be a noun, a present tense verb or a past tense verb. So machine learning techniques are usually used to perform automatic POS tagging.
For training, one needs a corpus. The Wall Street Journal Corpus (available at www.ldc.upenn.edu) is the largest annotated corpus available. It has 36 categories -includes categories such as "Foreign Word", "Determiner", "Verb base form", etc.
However, training on this corpus will not help in analyzing email messages for example.

References
Maximum Entropy part of speech tagger

0 Comments:

Post a Comment

<< Home