A multi-domain web-based algorithm for POS tagging of unknown words

Authors:
Shulamit Umansky-Pesin;Roi Reichart;Ari Rappoport
Affiliations:
The Hebrew University;The Hebrew University;The Hebrew University
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 17
Cited 3

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Bootstrapping POS taggers using unlabelled data

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Guessing parts-of-speech of unknown words using global information

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Part-of-speech tagging using virtual evidence and negative training

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Effective self-training for parsing

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A method for automatic POS guessing of Chinese unknown words

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Domain adaptation with structural correspondence learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Parsing biomedical literature

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Adapting text instead of the model: an open domain approach

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Feature-rich part-of-speech tagging for morphologically complex languages: application to Bulgarian

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Sentimantics: conceptual spaces for lexical sentiment polarity representation with contextuality

WASSA '12 Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a web-based algorithm for the task of POS tagging of unknown words (words appearing only a small number of times in the training data of a supervised POS tagger). When a sentence s containing an unknown word u is to be tagged by a trained POS tagger, our algorithm collects from the web contexts that are partially similar to the context of u in s, which are then used to compute new tag assignment probabilities for u. Our algorithm enables fast multi-domain unknown word tagging, since, unlike previous work, it does not require a corpus from the new domain. We integrate our algorithm into the MXPOST POS tagger (Ratnaparkhi, 1996) and experiment with three languages (English, German and Chinese) in seven in-domain and domain adaptation scenarios. Our algorithm provides an error reduction of up to 15.63% (English), 18.09% (German) and 13.57% (Chinese) over the original tagger.