A hybrid rule/model-based finite-state framework for normalizing SMS messages

Authors:
Richard Beaufort;Sophie Roekhaut;Louise-Amélie Cougnon;Cédrick Fairon
Affiliations:
Université catholique de Louvain, Louvain-la-Neuve, Belgium;Université de Mons, Mons, Belgium;Université catholique de Louvain, Louvain-la-Neuve, Belgium;Université catholique de Louvain, Louvain-la-Neuve, Belgium
Venue:
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Year:
2010

Citing 12
Cited 12

A design principles of a weighted finite-state transducer library

Theoretical Computer Science - Special issue on implementing automata
Finite-State Language Processing

Finite-State Language Processing
Generic epsilon -Removal Algorithm for Weighted Automata

CIAA '00 Revised Papers from the 5th International Conference on Implementation and Application of Automata
An efficient compiler for weighted rewrite rules

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Pronunciation modeling for improved spelling correction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A phrase-based statistical model for SMS text normalization

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
Normalizing SMS: are two metaphors better than one?

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
An unsupervised model for text message normalization

CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Lexical normalisation of short text messages: makn sens a #twitter

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Contextual bearing on linguistic variation in social media

LSM '11 Proceedings of the Workshop on Languages in Social Media
Short message communications: users, topics, and in-language processing

Proceedings of the 2nd ACM Symposium on Computing for Development
Review: SMS spam filtering: Methods and data

Expert Systems with Applications: An International Journal
Processing informal, romanized Pakistani text messages

LSM '12 Proceedings of the Second Workshop on Language in Social Media
A broad-coverage normalization system for social media language

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Building a lightweight semantic model for unsupervised information extraction on short listings

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Approaches of anonymisation of an SMS corpus

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Normalization of informal text

Computer Speech and Language
Chinese-English mixed text normalization

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, research in natural language processing has increasingly focused on normalizing SMS messages. Different well-defined approaches have been proposed, but the problem remains far from being solved: best systems achieve a 11% Word Error Rate. This paper presents a method that shares similarities with both spell checking and machine translation approaches. The normalization part of the system is entirely based on models trained from a corpus. Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score.