A Google-based statistical acquisition model of Chinese lexical concepts

  • Authors:
  • Jiayu Zhou;Shi Wang;Cungen Cao

  • Affiliations:
  • Department of Computer Information and Technology, Beijing Jiaotong University, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate School of Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

  • Venue:
  • KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we propose a statistical model of Chinese lexical concepts based on the Google search engine and distinguish concepts from chunks using it. Firstly, we learn concept boundary words which can be seen as the statistical feature of concepts in large-scale corpora using Google. The instinctive linguistics hypothesis we believe is that if a chunk is a lexical concept, there must be some certain "intimate" words abut on its "head" and its "tail". Secondly, we construct a classifier according to the concept boundary words and then distinguish concepts from chunks by it. We consider the conditional probability, the frequency and the entropy of the concept boundary words and propose three attributes models to build the classifier. Experiments are designed to compare the three classifiers and show the best method can validate concepts with an accuracy rate of 90.661%.