Context based spelling correction
Information Processing and Management: an International Journal
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
An improved error model for noisy channel spelling correction
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A phrase-based statistical model for SMS text normalization
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Investigation and modeling of the structure of texting language
International Journal on Document Analysis and Recognition
A survey of types of text noise and techniques to handle noisy text
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Design challenges and misconceptions in named entity recognition
CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Normalizing SMS: are two metaphors better than one?
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
An unsupervised model for text message normalization
CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
An overview of Microsoft web N-gram corpus and applications
HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
A hybrid rule/model-based finite-state framework for normalizing SMS messages
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
WSA '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
Recognizing named entities in tweets
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Lexical normalisation of short text messages: makn sens a #twitter
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Contextual bearing on linguistic variation in social media
LSM '11 Proceedings of the Workshop on Languages in Social Media
Why is "SXSW" trending?: exploring multiple text sources for Twitter topic summarization
LSM '11 Proceedings of the Workshop on Languages in Social Media
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Named entity recognition in tweets: an experimental study
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatically constructing a normalisation dictionary for microblogs
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Lexical normalization for social media text
ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Chinese-English mixed text normalization
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and message-level using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a 10% absolute increase compared to state-of-the-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.