Topic-Based Vietnamese News Document Filtering in the BioCaster Project

  • Authors:
  • Vu Hoang;Nguyen Nguyen;Dien Dinh;Nigel Collier

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ALPIT '07 Proceedings of the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we describe a topic-based Vietnamese news document filtering (VTDF) system in the BioCaster Project which automatically classifies news documents from a wide variety of sources into relevant topics suitable for disease outbreak detection. Given the very large numbers of news reports that have to be analyzed each day, VTDF is a crucial pre processing step in reducing the burden of semantic annotation. Here we present two different approaches for the Vietnamese document classification problem which will be used in the VTDF system. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that N-Gram could achieve an average of 95% accuracy with an average 79 minutes filtering time for about 14,000 documents (3 docs/sec).