Unsupervised cleansing of noisy text

Authors:
Danish Contractor;Tanveer A. Faruquie;L. Venkata Subramaniam
Affiliations:
IBM India Software Labs;IBM Research India;IBM Research India
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 15
Cited 9

The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Bitext maps and alignment via pattern recognition

Computational Linguistics
Decoding complexity in word-replacement translation models

Computational Linguistics
A DP based search using monotone alignments in statistical translation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Automatic construction of machine translation knowledge using translation literalness

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
A phrase-based statistical model for SMS text normalization

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Language Models for Handwritten Short Message Services

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Normalizing SMS: are two metaphors better than one?

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Design of the moses decoder for statistical machine translation

SETQA-NLP '08 Software Engineering, Testing, and Quality Assurance for Natural Language Processing
Fast sequential decoding algorithm using a stack

IBM Journal of Research and Development
Language independent unsupervised learning of short message service dialect

International Journal on Document Analysis and Recognition - Special Issue NOISY
SMS based interface for FAQ retrieval

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Automatic filtering of bilingual corpora for statistical machine translation

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

Handling noisy queries in cross language FAQ retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Contextual bearing on linguistic variation in social media

LSM '11 Proceedings of the Workshop on Languages in Social Media
Experiments with artificially generated noise for cleansing noisy text

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Unsupervised mining of lexical variants from noisy text

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Evaluating bad query abandonment in an iterative SMS-based FAQ retrieval system

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Normalization of informal text

Computer Speech and Language
A statistical approach to mining customers' conversational data from social media

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propose an unsupervised method for the translation of noisy text to clean text. Our method has two steps. For a given noisy sentence, a weighted list of possible clean tokens for each noisy token are obtained. The clean sentence is then obtained by maximizing the product of the weighted lists and the language model scores.