Hamotzi's Data Mining Log

Generalized Bandit Problems 1 (2003)

This paper describes the problem quite well and has a few examples of its application to finance.

Similarities between Named Entities

The paper: "An Information-Theoretic Definition of Similarity" is an interesting read - especially the Word similarity section. It would interesting to use this concept to perform similarity detection between named entities. That is: P(A|B,D) - given that named entity B has occured in document D, what is the probability of entity A?

Labels: project idea

DUC 2006 Task

Multi-document, query focused summarization:
- 50 topics
- 25 relevant docs per topic
- Summary must be 250 words
- Three different sources of news stories (AP, NYT and Xinhua)
- Corpus has a DTD

Automated Evaluation:
- 4 human summaries per topic
- ROUGE-2 and ROUGE-SU4 with stemming and keeping stopwords (Jacknifing?)
- BE (Basic Element) scores between manual and human summaries. Summaries will be parsed with Minipar and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.

Unsupervised learning papers - NLP class

A subset of interesting papers from:
1. David Elworthy. Does Baum-Welch Re-estimation Help Taggers? Fourth Conference on Applied Natural Language Processing. 1994.
2. Trond Grenager, Dan Klein and Christopher Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005.
3. John Miller, Manabu Torii and K. Vijay-Shanker. Building Domain-Specific Taggers without Annotated (Domain) Data. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
4. Qin Iris Wang and Dale Schuurmans. Improved Estimation for Unsupervised Part of Speech Tagging. In Proc. of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE). 2005.
5. Silviu Cucerzan and David Yarowsky. Language independent minimally supervised induction of lexical probabilities. Proceedings of ACL-2000, Hong Kong, pages 270-277. 2000.
6. Regina Barzilay and Lillian Lee. Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. In Proc. of HLT-NAACL 2004: Human Language Technology Conference and Meeting of the North American Chapter of the Association for Computational Linguistics. 2004.
7. Noah A. Smith and Jason Eisner. Annealing Techniques For Unsupervised Statistical Language Learning. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004.
8. Bo Thiesson, Christopher Meek, and David Heckerman. Accelerating EM for large databases. Machine Learning, 45:279-299, 2001.
9. Sajib Dasgupta and Vincent Ng. Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
10. Chris Biemann. Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering. Proceedings of the COLING/ACL 2006 Student Research Workshop. 2006.
11. Tetsuji Nakagawa and Yuji Matsumoto. Guessing Parts-of-Speech of Unknown Words Using Global Information. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: COLING/ACL 2006.
12. Mark Johnson. Why Doesn't EM Find Good HMM POS-Taggers? Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
13. Sharon Goldwater and Tom Griffiths. A fully Bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007.

NLP Researcher List

1. Anoop Sarkar
2. Ani Nenkova
3. Dragomir Radev

Tutorials on NLP Semi-Supervised Learning

1. ICML Tutorial 2007
2. Inductive Semi-supervised Learning Methods for Natural Language Processing
3. Semi-Supervised Learning Literature Survey
4. Semi-Supervised Learning
5. Semi-Supervised Learning (Powerpoint)
6. Workshop on Graph-based methods for NLP
7. Semi-Supervised Learning Literature Survey
8. Learning with Labeled and Unlabeled data
9. Learning from Labeled and Unlabeled Data with Label Propagation
10. http://www.cs.umd.edu/~getoor/Publications/icml03-ws.pdf

DUC 2006

Motivating application: See DUC 2006. The system task for DUC 2006 will be to model real-world complex question answering, in which an information need cannot be satisfied by simply stating a name, date, quantity, etc. Given a topic (question) and a set of 25 relevant documents, the task is to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic statement. Successful performance on the task benefits from a combination of IR and NLP capabilities, including passage retrieval, compression, and fusion of information.
NIST has test data available here. (You will need to fill out of a form to request access to the data).
Approaches taken by various teams are detailed here.

Biased LexRank: Passage Retrieval Using Random Walks with Question-based priors

See link. Interesting approach for ranking sentences given a query - like Topic sensitive PageRank.

They focus on query-based or focused summarization: generate a summary of a set of related documents given a specific aspect of their common topic formulated as a natural language query. In generic summarization the objective is to cover as much salient information in the original documents as possible.

More specifically: Given a set of documents about a topic (e.g. "international adoption"), the systems are required to produce a summary that focuses on the given aspect of that topic (such as "what are the laws, problems and issues surrounding international adoption by American families?") . This is referred to as Passage retrieval in information retrieval. In Question Answering, one tries to retrieve a few sentences that are relevant to question and thus potentially contain the answer. However, in summarization, we look for longer answers that are several sentences long.

Their approach: Given a single parameter one can determine how much of the relevant passage should be query-independent or query-based. It is thus semi-supervised. They do not use any particular linguistic resources. They consider intra-sentence similarities in addition to similarity of the candidate sentences to the query.

Their ranking technique of sentences thus includes - relevance and intra-sentence similarity.

NL Researcher: Ani Nenkova

See link.

Possible extensions for Query-based single document summarization

Given a single document summarizer as detailed in this paper, the possible extensions are:
(1) Use a Maximal Marginal Relevance scoring system for ensuring the sentence does not overlap too much with the already extracted sentences. This scoring function would be used for Content selection.
This scoring approach makes sense for a multi-document scenario - I'm not too sure how it helps in a single document scenario. See: "Multi-document summarization by sentence extraction"

Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling

See this link.
The note that CRFs have proved to be particularly useful for sequence segmentation and labeling tasks, since, as conditional models of the labels given inputs, they relax the independence assumptions made by traditional generative models like hidden Markov models. In this paper, we propose a new semi-supervised training method for conditional random fields (CRFs) that incorporates both labeled and unlabeled sequence data to estimate a discriminative structured predictor.

Name Tagging with Word Clusters and Discriminative Training

See this link.
They generate features from unlabeled data. These features are used in a discriminatively trained tagging model. They use Active learning to select training examples. Testing is performed using Named Entity recognition.

Semi-supervised learning for Natural Language

See link.
"In the spirit of (Miller et al., 2004), our basic strategy for taking advantage of
unlabeled data is to first derive features from unlabeled data|in our case, word
clustering or mutual information features|and then use these features in a supervised
learning algorithm. (Miller et al., 2004) achieved signicant performance gains in
named-entity recognition by using word clustering features and active learning. In
this thesis, we show that another type of unlabeled data feature based on mutual
information can also signicantly improve performance."
"(Shi and Sarkar, 2005) takes a similar approach for the problem of extracting
course names from web pages. They rst solve the easier problem of identifying
course numbers on web pages and then use features based on course numbers to solve
the original problem of identifying course names. Using EM, they show that adding
those features leads to signicant improvements."
The results were not that great.

Labels: nlp, ssl

Intimate Learning: A Novel Approach for Combining Labelled and Unlabelled Data

This paper describes a bootstrapping method
closely related to co-training and scoped-learning and is used for Web information extraction task -learning course names from web pages in which we use very few labelled items as seed data (10 web pages) and combine with an unlabelled set (174 web pages). The overall performance improved the precision/recall from 3.11%/0.31% for a baseline EM-based method to 44.7%/44.1% for intimate learning. They used the WebKB dataset.

In co-training there are two views of the same data but one class - but in their approach - there's one view but labeled into two classes (target and intimate classes)

Labels: nlp, ssl

Link: Researcher SSL NLP

Link to NLP Lab Simon Fraiser University.
They extended Abney's analysis of Yarowsky's algorithm.
In another work, we used the Yarowsky algortihm's idea and did semi-supervised learning to boost the quality of the best existing Machine Translation systems.
Currently, we are investigating new semi-supervised learning techniques for hidden Markov models and probabilistic context free grammars, which are probabilistic models used extensively to model and solve many tasks in many fields.

Labels: link, nlp, researcher

Hamotzi's Data Mining Log

Wednesday, December 24, 2008

Generalized Bandit Problems 1 (2003)

Thursday, December 18, 2008

Similarities between Named Entities

Thursday, November 13, 2008

DUC 2006 Task

Thursday, November 06, 2008

Unsupervised learning papers - NLP class

NLP Researcher List

Tutorials on NLP Semi-Supervised Learning

Sunday, November 02, 2008

DUC 2006

Friday, October 31, 2008

Biased LexRank: Passage Retrieval Using Random Walks with Question-based priors

NL Researcher: Ani Nenkova

Possible extensions for Query-based single document summarization

Tuesday, October 21, 2008

Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling

Name Tagging with Word Clusters and Discriminative Training

Semi-supervised learning for Natural Language

Intimate Learning: A Novel Approach for Combining Labelled and Unlabelled Data

Link: Researcher SSL NLP

About Me

Links

Previous Posts

Archives