Investigation and modeling of the structure of texting language

Authors:
Monojit Choudhury;Rahul Saraf;Vijit Jain;Animesh Mukherjee;Sudeshna Sarkar;Anupam Basu
Affiliations:
Indian Institute of Technology, Department of Computer Science and Engineering, Kharagpur, India;Malaviya National Institute of Technology, Department of Computer Engineering, Jaipur, India;D.E. Shaw India Software Private Ltd, Hyderabad, India;Indian Institute of Technology, Department of Computer Science and Engineering, Kharagpur, India;Indian Institute of Technology, Department of Computer Science and Engineering, Kharagpur, India;Indian Institute of Technology, Department of Computer Science and Engineering, Kharagpur, India
Venue:
International Journal on Document Analysis and Recognition
Year:
2007

Citing 0
Cited 23

A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
An unsupervised model for text message normalization

CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
A hybrid rule/model-based finite-state framework for normalizing SMS messages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Data-driven computational linguistics at FaMAF-UNC, Argentina

YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Handling noisy queries in cross language FAQ retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Classifying sentiment in microblogs: is brevity an advantage?

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Unsupervised cleansing of noisy text

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Lexical normalisation of short text messages: makn sens a #twitter

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Contextual bearing on linguistic variation in social media

LSM '11 Proceedings of the Workshop on Languages in Social Media
Experiments with artificially generated noise for cleansing noisy text

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
SMS normalization: combining phonetics, morphology and semantics

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Unsupervised mining of lexical variants from noisy text

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Short message communications: users, topics, and in-language processing

Proceedings of the 2nd ACM Symposium on Computing for Development
Autonomous self-assessment of autocorrections: exploring text message dialogues

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Processing informal, romanized Pakistani text messages

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Personalized normalization for a multilingual chat system

ACL '12 Proceedings of the ACL 2012 System Demonstrations
A broad-coverage normalization system for social media language

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Microblog-genre noise and impact on semantic annotation accuracy

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Normalization of informal text

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language usage over computer mediated discourses, such as chats, emails and SMS texts, significantly differs from the standard form of the language and is referred to as texting language (TL). The presence of intentional misspellings significantly decrease the accuracy of existing spell checking techniques for TL words. In this work, we formally investigate the nature and type of compressions used in SMS texts, and develop a Hidden Markov Model based word-model for TL. The model parameters have been estimated through standard machine learning techniques from a word-aligned SMS and standard English parallel corpus. The accuracy of the model in correcting TL words is 57.7%, which is almost a threefold improvement over the performance of Aspell. The use of simple bigram language model results in a 35% reduction of the relative word level error rates.