Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Correcting ESL errors using phrasal SMT techniques
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A phrase-based statistical model for SMS text normalization
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Investigation and modeling of the structure of texting language
International Journal on Document Analysis and Recognition
Special issue on noisy text analytics
International Journal on Document Analysis and Recognition
Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences
International Journal on Document Analysis and Recognition
How Much Noise Is Too Much: A Study in Automatic Text Classification
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Adapting a WSJ-trained parser to grammatically noisy text
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
A survey of types of text noise and techniques to handle noisy text
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Normalizing SMS: are two metaphors better than one?
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
GenERRate: generating errors for use in grammatical error detection
EdAppsNLP '09 Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications
Domain adaptation for statistical machine translation with monolingual resources
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
An unsupervised model for text message normalization
CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
"cba to check the spelling" investigating parser performance on discussion forum posts
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Predicting the Future with Social Media
WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Everyone's an influencer: quantifying influence on twitter
Proceedings of the fourth ACM international conference on Web search and data mining
Robust sentiment detection on Twitter from biased and noisy data
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Unsupervised cleansing of noisy text
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Hi-index | 0.00 |
Recent works show that the problem of noisy text normalization can be treated as a machine translation (MT) problem with convincing results. There have been supervised MT approaches which use noisy-regular parallel data for training an MT model, as well as unsupervised models which learn the translation probabilities in alternative ways and try to mimic the MT-based approach. While the supervised approaches suffer from data annotation and domain adaptation difficulties, the unsupervised models lack a holistic approach catering to all types of noise. In this paper, we propose an algorithm to artificially generate noisy text in a controlled way, from any regular English text. We see this approach as an alternative to the unsupervised approaches while getting the advantages of a parallel corpus based MT approach. We generate parallel noisy text from two widely used regular English datasets and test the MT-based approach for text normalization. Semi-supervised approaches were also tried to explore different ways of improving the parallel corpus (manually annotated) based MT approach by using the generated noisy text. An extensive analysis based on comparison of our approaches with both the supervised as well as unsupervised approaches is presented.