Semi-supervised sequence modeling with syntactic topic models

Authors:
Wei Li;Andrew McCallum
Affiliations:
Computer Science Department, University of Massachusetts, Amherst;Computer Science Department, University of Massachusetts, Amherst
Venue:
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Year:
2005

Citing 7
Cited 14

Class-based n-gram models of natural language

Computational Linguistics
Representations of quasi-Newton matrices and their use in limited memory methods

Mathematical Programming: Series A and B
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Applying discrete PCA in data analysis

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Automatic new topic identification using multiple linear regression

Information Processing and Management: an International Journal
Semi-supervised conditional random fields for improved sequence segmentation and labeling

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Simple, robust, scalable semi-supervised learning via expectation regularization

Proceedings of the 24th international conference on Machine learning
Mining, indexing, and searching for textual chemical molecule information on the web

Proceedings of the 17th international conference on World Wide Web
Using hidden Markov random fields to combine distributional and pattern-based word clustering

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Domain adaptation with latent semantic association for named entity recognition

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Locating complex named entities in web text

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

The Journal of Machine Learning Research
Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Think globally, apply locally: using distributional characteristics for Hindi named entity identification

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Semi-supervised learning of concatenative morphology

SIGMORPHON '10 Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
Semi-supervised ranking for document retrieval

Computer Speech and Language
Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents

ACM Transactions on Information Systems (TOIS)
The latent words language model

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although there has been significant previous work on semi-supervised learning for classification, there has been relatively little in sequence modeling. This paper presents an approach that leverages recent work in manifold-learning on sequences to discover word clusters from language data, including both syntactic classes and semantic topics. From unlabeled data we form a smooth. low-dimensional feature space, where each word token is projected based on its underlying role as a function or content word. We then use this projection as additional input features to a linear-chain conditional random field trained on limited labeled training data. On standard part-of-speech tagging and Chinese word segmentation data sets we show as much as 14% error reduction due to the unlabeled data, and also statistically-significant improvements over a related semi-supervised sequence tagging method due to Miller et al.