Chinese Word Segmentation for Terrorism-Related Contents

  • Authors:
  • Daniel Zeng;Donghua Wei;Michael Chau;Feiyue Wang

  • Affiliations:
  • Institute of Automation, Chinese Academy of Sciences, China and The University of Arizona, Tucson, USA;Institute of Automation, Chinese Academy of Sciences, China;The University of Hong Kong, Hong Kong,;Institute of Automation, Chinese Academy of Sciences, China and The University of Arizona, Tucson, USA

  • Venue:
  • PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.