An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Is it really about me?: message content in social awareness streams
Proceedings of the 2010 ACM conference on Computer supported cooperative work
Characterizing debate performance via aggregated twitter sentiment
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Towards detecting influenza epidemics by analyzing Twitter messages
Proceedings of the First Workshop on Social Media Analytics
Towards personalized learning to rank for epidemic intelligence based on social media streams
Proceedings of the 21st international conference companion on World Wide Web
Proceedings of the 23rd ACM conference on Hypertext and social media
Understanding the diversity of tweets in the time of outbreaks
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
In this paper, we study the use of microblogs as source of information for medical intelligence gathering. The huge amount of irrelevant data available in microblogs requires sophisticated filtering methods in order to identify only relevant postings. Microblogs are characteristically sparse and noisy. This requires additional considerations for selection of features for automatic classification for relevance with respect to medical intelligence gathering. In this paper, we will analyze which features are well suited. The objective of this work is three-fold: 1) Specifying annotation guidelines for creating a dataset for microblog classification, 2) Studying the characteristics of tweets for deciding on a well suited feature set, and 3) making use of that feature set in an automatic classification system for relevance filtering of microblogs. The quality of the classifier is assessed in experiments with various feature sets. The evaluation shows that despite the challenging characteristics of mircoblogs, good accuracy values of up to 89% can be achieved by the classifier. One main outcome of this work is a data set of annotated twitter data which can be used as a "gold standard" benchmark for further research in this domain.