Unsupervised Part-of-Speech Tagging in the Large

Authors:
Chris Biemann
Affiliations:
Microsoft Corporation / Powerset, San Francisco, USA 94110
Venue:
Research on Language and Computation
Year:
2009

Citing 25
Cited 0

Class-based n-gram models of natural language

Computational Linguistics
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Information Retrieval

Information Retrieval
An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Distributional part-of-speech tagging

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Part-of-speech induction from scratch

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Combining distributional and morphological information for part of speech induction

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Shallow parsing on the basis of words only: a case study

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Inducing syntactic categories by context distribution clustering

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Introduction to the CoNLL-2000 shared task: chunking

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Domain kernels for word sense disambiguation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Toward unsupervised whole-corpus tagging

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Prototype-driven learning for sequence models

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A practical solution to the problem of automatic part-of-speech induction from text

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Domain adaptation for statistical classifiers

Journal of Artificial Intelligence Research
Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Equations for part-of-speech tagging

AAAI'93 Proceedings of the eleventh national conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Syntactic preprocessing is a step that is widely used in NLP applications. Traditionally, rule-based or statistical Part-of-Speech (POS) taggers are employed that either need considerable rule development times or a sufficient amount of manually labeled data. To alleviate this acquisition bottleneck and to enable preprocessing for minority languages and specialized domains, a method is presented that constructs a statistical syntactic tagger model from a large amount of unlabeled text data. The method presented here is called unsupervised POS-tagging, as its application results in corpus annotation in a comparable way to what POS-taggers provide. Nevertheless, its application results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger. These differences hamper evaluation procedures that compare the output of the unsupervised POS-tagger to a tagging with a supervised tagger. To measure the extent to which unsupervised POS-tagging can contribute in application-based settings, the system is evaluated in supervised POS-tagging, word sense disambiguation, named entity recognition and chunking. Unsupervised POS-tagging has been explored since the beginning of the 1990s. Unlike in previous approaches, the kind and number of different tags is here generated by the method itself. Another difference to other methods is that not all words above a certain frequency rank get assigned a tag, but the method is allowed to exclude words from the clustering, if their distribution does not match closely enough with other words. The lexicon size is considerably larger than in previous approaches, resulting in a lower out-of-vocabulary (OOV) rate and in a more consistent tagging. The system presented here is available for download as open-source software along with tagger models for several languages, so the contributions of this work can be easily incorporated into other applications.