Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples

Authors:
Sreangsu Acharyya;Sumit Negi;L. V. Subramaniam;Shourya Roy
Affiliations:
University of Texas, Austin;IBM Indian Research Lab;IBM Indian Research Lab;IBM Indian Research Lab
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 10
Cited 2

Elements of information theory

Elements of information theory
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Class-based n-gram models of natural language

Computational Linguistics
Probabilistic independence networks for hidden Markov probability models

Neural Computation
An introduction to variational methods for graphical models

Learning in graphical models
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text classification using string kernels

The Journal of Machine Learning Research
Pronunciation modeling for improved spelling correction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Learning a spelling error model from search query logs

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

SMS based interface for FAQ retrieval

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Tokenizing micro-blogging messages using a text classification approach

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Noise in textual data such as those introduced by multi-linguality, misspellings, abbreviations, deletions, phonetic spellings, non standard transliteration, etc pose considerable problems for text-mining. Such corruptions are very common in instant messenger (IM) and short message service (SMS) data and adversely affect off the shelf text mining methods. Most techniques address this problem by supervised methods. But they require labels that are very expensive and time consuming to obtain. While we do not champion unsupervised methods over supervised when quality of results is the supreme and singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention to generate parallely labelled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A Hidden Markov Model (HMM) over subsequencized representation of words is used subject to a parameterization such that the training phase involves clustering over vectors and not the customary dynamic programming over sequences. A principled transformation of maximum likelihood based "central clustering" cost function into a "pairwise similarity" based clustering is proposed. This transformation makes it possible to apply "subsequence kernel" based methods that model delete and insert edit operations well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through a careful factorization of the HMM Loglikelihood and in establishing the connection between information theoretic cost function and the kernel approach of machine learning. Anecdotal evidence of efficacy is provided on public and proprietary data.