A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP

Authors:
Fuji Ren
Affiliations:
-
Venue:
CICLing '01 Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2001

Citing 3
Cited 0

Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Natural language processing, such as Checking and Correction of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European languages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have difficulty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic structures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.