Experiments with artificially generated noise for cleansing noisy text

Authors:
Phani Gadde;Rahul Goutam;Rakshit Shah;Hemanth Sagar Bayyarapu;L. V. Subramaniam
Affiliations:
Language Technologies Research Centre, IIIT-Hyderabad, India;Language Technologies Research Centre, IIIT-Hyderabad, India;Language Technologies Research Centre, IIIT-Hyderabad, India;Language Technologies Research Centre, IIIT-Hyderabad, India;IBM Research, New Delhi, India
Venue:
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Year:
2011

Citing 21
Cited 0

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Correcting ESL errors using phrasal SMT techniques

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A phrase-based statistical model for SMS text normalization

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
Special issue on noisy text analytics

International Journal on Document Analysis and Recognition
Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences

International Journal on Document Analysis and Recognition
How Much Noise Is Too Much: A Study in Automatic Text Classification

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Adapting a WSJ-trained parser to grammatically noisy text

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Normalizing SMS: are two metaphors better than one?

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
GenERRate: generating errors for use in grammatical error detection

EdAppsNLP '09 Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications
Domain adaptation for statistical machine translation with monolingual resources

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
An unsupervised model for text message normalization

CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
"cba to check the spelling" investigating parser performance on discussion forum posts

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Predicting the Future with Social Media

WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Everyone's an influencer: quantifying influence on twitter

Proceedings of the fourth ACM international conference on Web search and data mining
Robust sentiment detection on Twitter from biased and noisy data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Unsupervised cleansing of noisy text

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well as unsupervised models which learn the translation probabilities in alternative ways and try to mimic the MT-based approach. While the supervised approaches suffer from data annotation and domain adaptation difficulties, the unsupervised models lack a holistic approach catering to all types of noise. In this paper, we propose an algorithm to artificially generate noisy text in a controlled way, from any regular English text. We see this approach as an alternative to the unsupervised approaches while getting the advantages of a parallel corpus based MT approach. We generate parallel noisy text from two widely used regular English datasets and test the MT-based approach for text normalization. Semi-supervised approaches were also tried to explore different ways of improving the parallel corpus (manually annotated) based MT approach by using the generated noisy text. An extensive analysis based on comparison of our approaches with both the supervised as well as unsupervised approaches is presented.