Medical case-driven classification of microblogs: characteristics and annotation

Authors:
Mustafa Sofean;Kerstin Denecke;Avaré Stewart;Matthew Smith
Affiliations:
Leibniz University of Hannover, Hannover, Germany;L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany;Leibniz University of Hannover, Hannover, Germany
Venue:
Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Year:
2012

Citing 6
Cited 3

An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
TwitterStand: news in tweets

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Is it really about me?: message content in social awareness streams

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Characterizing debate performance via aggregated twitter sentiment

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Towards detecting influenza epidemics by analyzing Twitter messages

Proceedings of the First Workshop on Social Media Analytics

Towards personalized learning to rank for epidemic intelligence based on social media streams

Proceedings of the 21st international conference companion on World Wide Web
A real-time architecture for detection of diseases using social networks: design, implementation and evaluation

Proceedings of the 23rd ACM conference on Hypertext and social media
Understanding the diversity of tweets in the time of outbreaks

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the use of microblogs as source of information for medical intelligence gathering. The huge amount of irrelevant data available in microblogs requires sophisticated filtering methods in order to identify only relevant postings. Microblogs are characteristically sparse and noisy. This requires additional considerations for selection of features for automatic classification for relevance with respect to medical intelligence gathering. In this paper, we will analyze which features are well suited. The objective of this work is three-fold: 1) Specifying annotation guidelines for creating a dataset for microblog classification, 2) Studying the characteristics of tweets for deciding on a well suited feature set, and 3) making use of that feature set in an automatic classification system for relevance filtering of microblogs. The quality of the classifier is assessed in experiments with various feature sets. The evaluation shows that despite the challenging characteristics of mircoblogs, good accuracy values of up to 89% can be achieved by the classifier. One main outcome of this work is a data set of annotated twitter data which can be used as a "gold standard" benchmark for further research in this domain.