In natural language processing, the bootstrapping algorithm introduced by David Yarowsky (15 years ago) is a discriminative unsupervised learning algorithm that uses some seed rules to bootstrap a classifier (this is the ordinary sense of bootstrapping which is distinct from the Bootstrap in statistics). The Yarowsky algorithm works remarkably well on a wide variety of NLP classification tasks such as distinguishing between word senses and deciding if a noun phrase is an organization, location, or person.
Extending previous attempts at providing an objective function optimization view of Yarowsky, we show that bootstrapping a classifier from a small set of seed rules can be viewed as the propagation of labels between examples via features shared between them. This paper introduces a novel variant of the Yarowsky algorithm based on this view. It is a bootstrapping learning method which uses a graph propagation algorithm with a well defined per-iteration objective function that incorporates the cautious behaviour of the original Yarowsky algorithm.
The experimental results show that our proposed bootstrapping algorithm achieves state of the art performance or better on several different natural language data sets, outperforming other unsupervised methods such as the EM algorithm. We show that cautious learning is an important principle in unsupervised learning, however we do not understand it well, and we show that the Yarowsky algorithm can outperform or match co-training without any reliance on multiple views.
About the Speaker: Anoop Sarkar is an Associate Professor at Simon Fraser University in British Columbia, Canada where he co-directs the Natural Language Laboratory. He received his Ph.D. from the Department of Computer and Information Sciences at the University of Pennsylvania under Prof. Aravind Joshi for his work on semi-supervised statistical parsing using tree-adjoining grammars. His research is focused on statistical parsing and machine translation (exploiting syntax or morphology, semi-supervised learning, and domain adaptation). His interests also include formal language theory and stochastic grammars, in particular tree automata and tree-adjoining grammars.