Subword variation in text message classification

Authors:
Robert Munro;Christopher D. Manning
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2010

Citing 14
Cited 5

Unsupervised learning of the morphology of a natural language

Computational Linguistics
Content based SMS spam filtering

Proceedings of the 2006 ACM symposium on Document engineering
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Feature engineering for mobile (SMS) spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A usability comparison of three alternative message formats for an SMS banking service

International Journal of Human-Computer Studies
Evaluation of preprocessing techniques for chief complaint classification

Journal of Biomedical Informatics
The impact of mobile telephony on developing country micro-enterprise: A nigerian case study

Information Technologies and International Development
Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
The SAWA corpus: a parallel corpus English - Swahili

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Normalizing SMS: are two metaphors better than one?

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Short message communications: users, topics, and in-language processing

Proceedings of the 2nd ACM Symposium on Computing for Development
Accurate unsupervised joint named-entity extraction from unaligned parallel text

NEWS '12 Proceedings of the 4th Named Entity Workshop
A Dispatch-Mediated Communication Model for Emergency Response Systems

ACM Transactions on Management Information Systems (TMIS)
Crowdsourcing and the crisis-affected community

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

For millions of people in less resourced regions of the world, text messages (SMS) provide the only regular contact with their doctor. Classifying messages by medical labels supports rapid responses to emergencies, the early identification of epidemics and everyday administration, but challenges include text-brevity, rich morphology, phonological variation, and limited training data. We present a novel system that addresses these, working with a clinic in rural Malawi and texts in the Chichewa language. We show that modeling morphological and phonological variation leads to a substantial average gain of F=0.206 and an error reduction of up to 63.8% for specific labels, relative to a baseline system optimized over word-sequences. By comparison, there is no significant gain when applying the same system to the English translations of the same texts/labels, emphasizing the need for subword modeling in many languages. Language independent morphological models perform as accurately as language specific models, indicating a broad deployment potential.