Saturday, August 27, 2005

Topic: Phrase Recognition (Text Processing)

My notes from the the book - "Text Mining: Predictive Methods for Analyzing Unstructured Information".

Phrase Recognition is useful for creating a "partial parse" of a sentence and as a step in identifying the "Named Entities" occuring in a sentence. It is a data pre-processing step that is performed after the tokens in the sentences have been tagged by their Parts of Speech.
Phrase Recognition systems are supposed to scan a text and mark the beginnings and ends of phrases. Types of phrases are Noun phrases, Verb phrases and Prepositional phrases. One convention is to mark a word inside a phrase with "I-", a word at the beginning of a phrase adjacent to another phrase with B- and a word outside any phrase with O-. To the I- and B- tags we then add a code for the phrase type - e.g I-NP (Noun phrase).
This can be considered as a classification problem for the tokens of a sentence. There are several corpora available for developing and testing phrase recognition systems. Performance of these systems varies widely over phrase type - overall, its pretty good.

0 Comments:

Post a Comment

<< Home