Classifying Vietnamese disease outbreak reports with important sentences and rich features
Proceedings of the Third Symposium on Information and Communication Technology
Hi-index | 0.00 |
In this paper, we describe a topic-based Vietnamese news document filtering (VTDF) system in the BioCaster Project which automatically classifies news documents from a wide variety of sources into relevant topics suitable for disease outbreak detection. Given the very large numbers of news reports that have to be analyzed each day, VTDF is a crucial pre processing step in reducing the burden of semantic annotation. Here we present two different approaches for the Vietnamese document classification problem which will be used in the VTDF system. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that N-Gram could achieve an average of 95% accuracy with an average 79 minutes filtering time for about 14,000 documents (3 docs/sec).