Classifying Vietnamese disease outbreak reports with important sentences and rich features

Authors:
Son Doan;Nguyen Thi Ngoc Vinh;Tu Minh Phuong
Affiliations:
National Institute of Informatics, Tokyo, Japan;Posts & Telecom. Institute of Technology, Hanoi, Vietnam;Posts & Telecom. Institute of Technology, Hanoi, Vietnam
Venue:
Proceedings of the Third Symposium on Information and Communication Technology
Year:
2012

Citing 12
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Making large-scale support vector machine learning practical

Advances in kernel methods
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Topic-Based Vietnamese News Document Filtering in the BioCaster Project

ALPIT '07 Proceedings of the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)
BioCaster

Bioinformatics
The role of roles in classifying annotated biomedical text

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Towards role-based filtering of disease outbreak reports

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification is an important field of research from mid 90s up to now. It has many applications, one of them is in Web-based biosurveillance systems which identify and summarize online disease outbreak reports. In this paper we focus on classifying Vietnamese disease outbreak reports. We investigate important properties of disease outbreak reports, e.g., sentences containing names of outbreak disease, locations. Evaluation on 10-time 10fold cross-validation using the Support Vector Machine algorithm shows that using sentences containing disease outbreak names with its preceding/following sentences in combination with location features achieve the best F-score with 86.67% - an improvement of 0.38% in comparison to using all raw text. Our results suggest that using important sentences and rich feature can improve performance of Vietnamese disease outbreak text classification.