A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP

  • Authors:
  • Fuji Ren

  • Affiliations:
  • -

  • Venue:
  • CICLing '01 Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Natural language processing, such as Checking and Correction of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European languages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have difficulty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic structures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.