Word frequency approximation for chinese without using manually-annotated corpus

  • Authors:
  • Maosong Sun;Zhengcao Zhang;Benjamin Ka-Yin T’sou;Huaming Lu

  • Affiliations:
  • The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, China;The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing, China;Language Information Sciences Research Center, City University of Hong Kong;School of Business, Beijing Institute of Machinery, Beijing, China

  • Venue:
  • CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.