Lexical normalisation of short text messages: makn sens a #twitter

Authors:
Bo Han;Timothy Baldwin
Affiliations:
The University of Melbourne;The University of Melbourne
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 17
Cited 32

The double metaphone search algorithm

C/C++ Users Journal
Computer programs for detecting and correcting spelling errors

Communications of the ACM
More accurate tests for the statistical significance of result differences

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Correcting real-word spelling errors by restoring lexical cohesion

Natural Language Engineering
Pronunciation modeling for improved spelling correction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic error detection in the Japanese learners' English spoken data

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A phrase-based statistical model for SMS text normalization

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Integrated scoring for spelling error correction, abbreviation expansion and case restoration in dirty text

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
An unsupervised model for text message normalization

CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
Unsupervised modeling of Twitter conversations

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A hybrid rule/model-based finite-state framework for normalizing SMS messages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Contextual bearing on linguistic variation in social media

LSM '11 Proceedings of the Workshop on Languages in Social Media
Extracting semantic annotations from twitter

Proceedings of the fourth workshop on Exploiting semantic annotations in information retrieval
Mining the interests of Chinese microbloggers via keyword extraction

Frontiers of Computer Science in China
Unsupervised mining of lexical variants from noisy text

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Named entity recognition in tweets: an experimental study

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Finding related micro-blogs based on wordnet

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Sentiment analysis on twitter data for portuguese language

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
TwiNER: named entity recognition in targeted twitter stream

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
The study of informality as a framework for evaluating the normalisation of web 2.0 texts

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
A support platform for event detection using social intelligence

EACL '12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Improving statistical machine translation for a resource-poor language using related resource-rich languages

Journal of Artificial Intelligence Research
Twitter user behavior understanding with mood transition prediction

Proceedings of the 2012 workshop on Data-driven user behavioral modelling and mining from social media
Processing informal, romanized Pakistani text messages

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Personalized normalization for a multilingual chat system

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Joint inference of named entity recognition and normalization for tweets

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A broad-coverage normalization system for social media language

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Source language adaptation for resource-poor machine translation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Language processing for arabic microblog retrieval

Proceedings of the 21st ACM international conference on Information and knowledge management
A summarization tool for time-sensitive social media

Proceedings of the 21st ACM international conference on Information and knowledge management
Two-stage NER for tweets with clustering

Information Processing and Management: an International Journal
Named entity recognition for tweets

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
Streaming trend detection in Twitter

International Journal of Web Based Communities
TV program detection in tweets

Proceedings of the 11th european conference on Interactive TV and video
Microblog-genre noise and impact on semantic annotation accuracy

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Harnessing linked knowledge sources for topic classification in social media

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Exploiting hybrid contexts for Tweet segmentation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Spanish knowledge base generation for polarity classification from masses

Proceedings of the 22nd international conference on World Wide Web companion
Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach

Proceedings of the VLDB Endowment
Listening to the crowd: automated analysis of events via aggregated twitter sentiment

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Chinese-English mixed text normalization

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.