A broad-coverage normalization system for social media language

Authors:
Fei Liu;Fuliang Weng;Xiao Jiang
Affiliations:
Research and Technology Center, Robert Bosch LLC;Research and Technology Center, Robert Bosch LLC;Research and Technology Center, Robert Bosch LLC
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Year:
2012

Citing 19
Cited 3

Context based spelling correction

Information Processing and Management: an International Journal
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A phrase-based statistical model for SMS text normalization

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Normalizing SMS: are two metaphors better than one?

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
An unsupervised model for text message normalization

CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
A hybrid rule/model-based finite-state framework for normalizing SMS messages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
The Edinburgh Twitter corpus

WSA '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
Recognizing named entities in tweets

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Lexical normalisation of short text messages: makn sens a #twitter

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Contextual bearing on linguistic variation in social media

LSM '11 Proceedings of the Workshop on Languages in Social Media
Why is "SXSW" trending?: exploring multiple text sources for Twitter topic summarization

LSM '11 Proceedings of the Workshop on Languages in Social Media
Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Named entity recognition in tweets: an experimental study

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Chinese-English mixed text normalization

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and message-level using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a 10% absolute increase compared to state-of-the-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.