Hamotzi's Data Mining Log: October 2008

Biased LexRank: Passage Retrieval Using Random Walks with Question-based priors

See link. Interesting approach for ranking sentences given a query - like Topic sensitive PageRank.

They focus on query-based or focused summarization: generate a summary of a set of related documents given a specific aspect of their common topic formulated as a natural language query. In generic summarization the objective is to cover as much salient information in the original documents as possible.

More specifically: Given a set of documents about a topic (e.g. "international adoption"), the systems are required to produce a summary that focuses on the given aspect of that topic (such as "what are the laws, problems and issues surrounding international adoption by American families?") . This is referred to as Passage retrieval in information retrieval. In Question Answering, one tries to retrieve a few sentences that are relevant to question and thus potentially contain the answer. However, in summarization, we look for longer answers that are several sentences long.

Their approach: Given a single parameter one can determine how much of the relevant passage should be query-independent or query-based. It is thus semi-supervised. They do not use any particular linguistic resources. They consider intra-sentence similarities in addition to similarity of the candidate sentences to the query.

Their ranking technique of sentences thus includes - relevance and intra-sentence similarity.

NL Researcher: Ani Nenkova

See link.

Possible extensions for Query-based single document summarization

Given a single document summarizer as detailed in this paper, the possible extensions are:
(1) Use a Maximal Marginal Relevance scoring system for ensuring the sentence does not overlap too much with the already extracted sentences. This scoring function would be used for Content selection.
This scoring approach makes sense for a multi-document scenario - I'm not too sure how it helps in a single document scenario. See: "Multi-document summarization by sentence extraction"

Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling

See this link.
The note that CRFs have proved to be particularly useful for sequence segmentation and labeling tasks, since, as conditional models of the labels given inputs, they relax the independence assumptions made by traditional generative models like hidden Markov models. In this paper, we propose a new semi-supervised training method for conditional random fields (CRFs) that incorporates both labeled and unlabeled sequence data to estimate a discriminative structured predictor.

Name Tagging with Word Clusters and Discriminative Training

See this link.
They generate features from unlabeled data. These features are used in a discriminatively trained tagging model. They use Active learning to select training examples. Testing is performed using Named Entity recognition.

Semi-supervised learning for Natural Language

See link.
"In the spirit of (Miller et al., 2004), our basic strategy for taking advantage of
unlabeled data is to first derive features from unlabeled data|in our case, word
clustering or mutual information features|and then use these features in a supervised
learning algorithm. (Miller et al., 2004) achieved signicant performance gains in
named-entity recognition by using word clustering features and active learning. In
this thesis, we show that another type of unlabeled data feature based on mutual
information can also signicantly improve performance."
"(Shi and Sarkar, 2005) takes a similar approach for the problem of extracting
course names from web pages. They rst solve the easier problem of identifying
course numbers on web pages and then use features based on course numbers to solve
the original problem of identifying course names. Using EM, they show that adding
those features leads to signicant improvements."
The results were not that great.

Labels: nlp, ssl

Intimate Learning: A Novel Approach for Combining Labelled and Unlabelled Data

This paper describes a bootstrapping method
closely related to co-training and scoped-learning and is used for Web information extraction task -learning course names from web pages in which we use very few labelled items as seed data (10 web pages) and combine with an unlabelled set (174 web pages). The overall performance improved the precision/recall from 3.11%/0.31% for a baseline EM-based method to 44.7%/44.1% for intimate learning. They used the WebKB dataset.

In co-training there are two views of the same data but one class - but in their approach - there's one view but labeled into two classes (target and intimate classes)

Labels: nlp, ssl

Link: Researcher SSL NLP

Link to NLP Lab Simon Fraiser University.
They extended Abney's analysis of Yarowsky's algorithm.
In another work, we used the Yarowsky algortihm's idea and did semi-supervised learning to boost the quality of the best existing Machine Translation systems.
Currently, we are investigating new semi-supervised learning techniques for hidden Markov models and probabilistic context free grammars, which are probabilistic models used extensively to model and solve many tasks in many fields.

Labels: link, nlp, researcher

Datasets for relational learning

See link. This was used by Pedro D. Each dataset has the set of papers he used.

Labels: dataset

Extracting Semantic Networks from Text via Relational Clustering

See link.
SNE, a scalable, unsupervised, and domain-independent system that simultaneously extracts high-level relations and concepts, and learns a semantic network from text.
It first uses TextRunner to extract ground facts as triples from text, and then extract knowledge from the triples.
It does so with a probabilistic model that clusters objects by the objects that they are related to, and that clusters relations by the objects they relate. This allows information to propagate between clusters of relations and clusters of objects as they are created. Each cluster represents a high-level relation or concept. A concept cluster can be viewed as a node in a graph, and a relation cluster can be viewed as links between the concept clusters that it relates. Together the concept clusters and relation clusters define a simple semantic network.

Wrapper methods

I found this presentation that describes wrapper methods for feature selection. IHe describes two heuristics: forward-selection and backward-selection.

Link: Introduction to SSL

See link.
The presentation focuses on semi-supervised classification. This is a good presentation. He focuses on ssl classification algorithms. For each algorithm he describes the assumptions, details and pros/cons.

Some highlights from the presentation:
Basic objective of SSL: How does one use unlabeled data to improve classification?
Approach: Use labeled and unlabeled data to build better learners (contrast with supervised and unsupervised approaches).
Note that it is not always the case that unlabeled data will help.

Types of semi-supervised learning algorithms:
(1) Self-training:
They are easy and widely used. However, early mistakes could reinforce themselves and can't predict convergence.

(2) Generative Models: Assume one has the full generative model: p(X,Y|O). Marginalize over the labels of the unlabeled instances to estimate the parameters O. Then use MLE/MAP/Bayesian techniques. E.g. Mixture of Gaussians, Mixture of Multinomials (Naive Bayes) and HMMs. (EM, Baum-Welch). Relies heavily on EM. If the model is correct, this can be very effective and is a nice probabilistic framework. But unlabeled data might hurt if the generative model is wrong. Need to use heuristics to reduce the impact of unlabeled data.

(3) Cluster and label approach: Use cluster labels to label unlabeled data.

(4) Semi-supervised SVMs: Also known as Transductive SVMs. Maximize unlabeled data margins. Makes an assumption that unlabeled data from different classes are separated by a margin.

(5) Graph-based algorithms: Labels propagate using similar unlabeled instances. A graph is given on the labeled and unlabeled data. Instances connected by heavy edge tend to have the same label. Graph-based algorithms: mincut, harmonic
local and global consistency and manifold regularization. Can be extended to directed graphs. Performance is good if the graph is good - bad otherwise.

(6) Multiview algorithms: Split the features for an instance. Train classifiers on each split and get the classifiers to teach each other. It assumes that the feature splits are conditionally independent given the class. Less sensitive to mistakes than self-training. Models using BOTH features should do better

Final analysis: (more in the presentation):
Use the right model for the job
no pain, no gain
no model assumption, no gain
wrong model assumption, no gain, a lot of pain

Labels: link, ssl

Semisupervised Learning for Computational Linguistics

Semisupervised Learning for Computational Linguistics book looks interesting. The Math has been tamed down - to help build intuition.
See the review here.

Labels: nlp, ssl

What is parser adaptation?

As per this paper: Leverage labeled data from one domain and create a parser capable of parsing a different domain. Their approach is simple: Train the parser on WSJ and then parse another domain. The parser generates n-best parses which are ranked using a re-ranker. The top ranked parse is used for self-training. Re-ranking and adaptive learning are open issues.

Labels: nlp, ssl

What is "co-training"?

As per this paper : Co-training like self-training learns on its own predictions. However, co-training learns two separate models (which are assumed to be independent typically by training on disjoint feature sets). These models are applied to unlabeled data. Examples on which these two models agree are treated as labeled data for a new round of training. One could incorporate model confidence to include only examples both models are confident of.
The definitive paper is Blum and Mitchel
Also this paper by Collins and Singer.

Labels: cotraining, nlp, ssl

What is "self-training"?

As per this paper : One learns a model by training on a small amount of labeled data. The model is then evaluated on a large amount of unlabeled data. Its predictions are assumed to be correct and it is retrained on the unlabeled data according to its own predictions.

Labels: nlp, ssl

What is "constraint-based" learning?

I first heard of this in the issues for the NLP workshop. I found this paper.

Incorporate domain knowledge into semi-supervised learning - incorporate task specific constraints. For example: if you are performing an IE task of extracting fields from a citation record, then one can build constraints such as fields that start with 19xx or 20xx are dates. The generated solutions can be penalized for not meeting these constraints.

Labels: nlp, ssl

Workshop on Semi-supervised Learning for Natural Language Processing at NAACL HLT 2009

Interesting write-up of the problems in semi-supervised learning for NLP.
Parts that interest me:
1. "What are the different classes of NLP problem structures (e.g. sequences, trees, N-best lists) and what algorithms are best suited for each class? For instance, can graph-based algorithms be successfully applied to sequence-to-sequence problems like machine translation, or are self-training and feature-based methods the only reasonable choices for these problems? "
2. What kinds of NLP-specific background knowledge can we exploit to aid semi-supervised learning? Recent learning paradigms such as constraint-driven learning and prototype learning take advantage of our domain knowledge about particular NLP tasks?
3. Many semi-supervised learning methods (e.g. transductive SVM, graph-based methods) have been originally developed for binary classification problems. NLP problems often pose new challenges to these techniques, involving more complex structure that can violate many of the underlying assumptions.

Labels: nlp

Hamotzi's Data Mining Log

Friday, October 31, 2008

Biased LexRank: Passage Retrieval Using Random Walks with Question-based priors

NL Researcher: Ani Nenkova

Possible extensions for Query-based single document summarization

Tuesday, October 21, 2008

Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling

Name Tagging with Word Clusters and Discriminative Training

Semi-supervised learning for Natural Language

Intimate Learning: A Novel Approach for Combining Labelled and Unlabelled Data

Link: Researcher SSL NLP

Monday, October 20, 2008

Datasets for relational learning

Extracting Semantic Networks from Text via Relational Clustering

Wrapper methods

Link: Introduction to SSL

Tuesday, October 14, 2008

Semisupervised Learning for Computational Linguistics

What is parser adaptation?

What is "co-training"?

What is "self-training"?

What is "constraint-based" learning?

Workshop on Semi-supervised Learning for Natural Language Processing at NAACL HLT 2009

About Me

Links

Previous Posts

Archives