Word frequency approximation for chinese without using manually-annotated corpus

Authors:
Maosong Sun;Zhengcao Zhang;Benjamin Ka-Yin T’sou;Huaming Lu
Affiliations:
The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, China;The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, China;Language Information Sciences Research Center, City University of Hong Kong;School of Business, Beijing Institute of Machinery, Beijing, China
Venue:
CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2006

Citing 2
Cited 1

Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Word frequency approximation for chinese using raw, MM-Segmented and manually segmented corpora

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.