A Google-based statistical acquisition model of Chinese lexical concepts

Authors:
Jiayu Zhou;Shi Wang;Cungen Cao
Affiliations:
Department of Computer Information and Technology, Beijing Jiaotong University, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate School of Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Venue:
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Year:
2007

Citing 8
Cited 1

Word association norms, mutual information, and lexicography

Computational Linguistics
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
Experiments on using semantic distances between words in image caption retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
SOAT: a semi-automatic domain ontology acquisition tool from Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Two-character Chinese word extraction based on hybrid of internal and contextual measures

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering

Learning Hierarchical Lexical Hyponymy

International Journal of Cognitive Informatics and Natural Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose a statistical model of Chinese lexical concepts based on the Google search engine and distinguish concepts from chunks using it. Firstly, we learn concept boundary words which can be seen as the statistical feature of concepts in large-scale corpora using Google. The instinctive linguistics hypothesis we believe is that if a chunk is a lexical concept, there must be some certain "intimate" words abut on its "head" and its "tail". Secondly, we construct a classifier according to the concept boundary words and then distinguish concepts from chunks by it. We consider the conditional probability, the frequency and the entropy of the concept boundary words and propose three attributes models to build the classifier. Experiments are designed to compare the three classifiers and show the best method can validate concepts with an accuracy rate of 90.661%.