Computer processing of Chinese characteristics: an overview of two decades research and development
Information Processing and Management: an International Journal
ACTS: an automatic Chinese text segmentation system for full text retrieval
Journal of the American Society for Information Science
Fast and quasi-natural language search for gigabytes of Chinese texts
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Combination and boundary detection approaches on Chinese indexing
Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws
Journal of the American Society for Information Science and Technology
Applying Machine Learning to Text Segmentation for Information Retrieval
Information Retrieval
An associate constraint network approach to extract multi-lingual information for crime analysis
Decision Support Systems
Cross-lingual thesaurus for multilingual knowledge management
Decision Support Systems
Finding Text Boundaries and Finding Topic Boundaries: Two Different Tasks?
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
A joint statistical model for simultaneous word spacing and spelling error correction for Korean
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Information Processing and Management: an International Journal
Text segmentation based on document understanding for information retrieval
NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Hi-index | 0.00 |
The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused on the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result shows that the proposed heuristic method is promising to segment the unknown words as well as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.