Chinese unknown word identification using character-based tagging and chunking
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Hi-index | 0.00 |
Unlisted word identification is the hotspot in the research of Chinese information processing. String frequency statistics is a simple and effective method of extraction unlisted word. Existing algorithm cannot meet the requirement of high speed in vast text processing system. According to strategies of string length increasing and level-wise scanning, this paper presents a fast algorithm of extracting frequent strings and improves string frequency statistical method. The approach does not need thesaurus, and does not need to word segmentation, but according to the average mutual information to identify whether each frequent string is a word. Compared with previous approaches, experiments show that the algorithm gains advantages such as high speed, high accuracy of 91% and above.