Thursday, November 13, 2008

DUC 2006 Task

Multi-document, query focused summarization:
- 50 topics
- 25 relevant docs per topic
- Summary must be 250 words
- Three different sources of news stories (AP, NYT and Xinhua)
- Corpus has a DTD

Automated Evaluation:
- 4 human summaries per topic
- ROUGE-2 and ROUGE-SU4 with stemming and keeping stopwords (Jacknifing?)
- BE (Basic Element) scores between manual and human summaries. Summaries will be parsed with Minipar and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.

Thursday, November 06, 2008

Unsupervised learning papers - NLP class

A subset of interesting papers from:
1. David Elworthy. Does Baum-Welch Re-estimation Help Taggers? Fourth Conference on Applied Natural Language Processing. 1994.
2. Trond Grenager, Dan Klein and Christopher Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005.
3. John Miller, Manabu Torii and K. Vijay-Shanker. Building Domain-Specific Taggers without Annotated (Domain) Data. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
4. Qin Iris Wang and Dale Schuurmans. Improved Estimation for Unsupervised Part of Speech Tagging. In Proc. of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE). 2005.
5. Silviu Cucerzan and David Yarowsky. Language independent minimally supervised induction of lexical probabilities. Proceedings of ACL-2000, Hong Kong, pages 270-277. 2000.
6. Regina Barzilay and Lillian Lee. Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. In Proc. of HLT-NAACL 2004: Human Language Technology Conference and Meeting of the North American Chapter of the Association for Computational Linguistics. 2004.
7. Noah A. Smith and Jason Eisner. Annealing Techniques For Unsupervised Statistical Language Learning. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004.
8. Bo Thiesson, Christopher Meek, and David Heckerman. Accelerating EM for large databases. Machine Learning, 45:279-299, 2001.
9. Sajib Dasgupta and Vincent Ng. Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
10. Chris Biemann. Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering. Proceedings of the COLING/ACL 2006 Student Research Workshop. 2006.
11. Tetsuji Nakagawa and Yuji Matsumoto. Guessing Parts-of-Speech of Unknown Words Using Global Information. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: COLING/ACL 2006.
12. Mark Johnson. Why Doesn't EM Find Good HMM POS-Taggers? Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
13. Sharon Goldwater and Tom Griffiths. A fully Bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007.

NLP Researcher List

Tutorials on NLP Semi-Supervised Learning

Sunday, November 02, 2008

DUC 2006

Motivating application: See DUC 2006. The system task for DUC 2006 will be to model real-world complex question answering, in which an information need cannot be satisfied by simply stating a name, date, quantity, etc. Given a topic (question) and a set of 25 relevant documents, the task is to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic statement. Successful performance on the task benefits from a combination of IR and NLP capabilities, including passage retrieval, compression, and fusion of information.
NIST has test data available here. (You will need to fill out of a form to request access to the data).
Approaches taken by various teams are detailed here.