An improved fast algorithm of frequent string extracting with no thesaurus

Authors:
Yumeng Zhang;Chuanhan Liu
Affiliations:
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China and School of Business, Ningbo University, Ningbo, China;Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Venue:
MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Year:
2007

Citing 2
Cited 0

Chinese unknown word identification using character-based tagging and chunking

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unlisted word identification is the hotspot in the research of Chinese information processing. String frequency statistics is a simple and effective method of extraction unlisted word. Existing algorithm cannot meet the requirement of high speed in vast text processing system. According to strategies of string length increasing and level-wise scanning, this paper presents a fast algorithm of extracting frequent strings and improves string frequency statistical method. The approach does not need thesaurus, and does not need to word segmentation, but according to the average mutual information to identify whether each frequent string is a word. Compared with previous approaches, experiments show that the algorithm gains advantages such as high speed, high accuracy of 91% and above.