Hamotzi's Data Mining Log: Topic: Sentence Boundary Detection (Text Processing)

Information Extraction algorithms require sophisticated linguistic parsing. They often operate on text a sentence at a time. For example to identify the parts of speech of each word in a text, the text first has to be segmented into boundaries since the influence of one word on the part-of-speech of another does not cross sentence boundaries.
Sentence boundary detection is essentially the problem of deciding which characters (such as periods in English) in the text are sentence delimeters and which are not.
This can be treated as a classification problem and in some studies accuracy of 98% has been achieved by using machine learning techniques. If training data is not available, one can use a handcrafted algorithm - which will be tailored to a particular language.

References:
A comparison of paradigms for improving MT quality

Hamotzi's Data Mining Log

Saturday, August 27, 2005

Topic: Sentence Boundary Detection (Text Processing)

0 Comments:

About Me

Previous Posts