Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results

Authors:
Phani Gadde;L. V. Subramaniam;Tanveer A. Faruquie
Affiliations:
Language Technologies Research Centre, IIIT-Hyderabad, India;IBM Research, New Delhi, India;IBM Research, New Delhi, India
Venue:
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Year:
2011

Citing 20
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Pronunciation modeling for improved spelling correction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
Special issue on noisy text analytics

International Journal on Document Analysis and Recognition
Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences

International Journal on Document Analysis and Recognition
Adapting a WSJ-trained parser to grammatically noisy text

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Language independent unsupervised learning of short message service dialect

International Journal on Document Analysis and Recognition - Special Issue NOISY
"cba to check the spelling" investigating parser performance on discussion forum posts

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Statistical machine translation of texts with misspelled words

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Lessons learned in part-of-speech tagging of conversational speech

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Everyone's an influencer: quantifying influence on twitter

Proceedings of the fourth ACM international conference on Web search and data mining
Robust sentiment detection on Twitter from biased and noisy data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Part-of-speech tagging for Twitter: annotation, features, and experiments

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms for words to save time and effort. We call that noisy text. Part-Of-Speech (POS) tagging has reached high levels of accuracy enabling the use of automatic POS tags in various language processing tasks, however, tagging performance on noisy text degrades very fast. This paper is an attempt to adapt a state-of-the-art English POS tagger, which is trained on the Wall-Street-Journal (WSJ) corpus, to noisy text. We classify the noise in text into different types and evaluate the tagger with respect to each type of noise. The problem of tagging noisy text is attacked in two ways; a) Trying to overcome noise as a post processing step to the tagging b) Cleaning the noise and then doing tagging. We propose techniques to solve the problem in both the ways and critically compare them based on the error analysis. We demonstrate the working of the proposed models on a Short Message Service (SMS) dataset which achieve a significant improvement over the baseline accuracy of tagging noisy words by a state-of-the-art English POS tagger.