Word association norms, mutual information, and lexicography
Computational Linguistics
Highlights: language- and domain-independent automatic indexing terms for abstracting
Journal of the American Society for Information Science
Experiments on using semantic distances between words in image caption retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
SOAT: a semi-automatic domain ontology acquisition tool from Chinese corpus
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Two-character Chinese word extraction based on hybrid of internal and contextual measures
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Learning Hierarchical Lexical Hyponymy
International Journal of Cognitive Informatics and Natural Intelligence
Hi-index | 0.00 |
In this paper we propose a statistical model of Chinese lexical concepts based on the Google search engine and distinguish concepts from chunks using it. Firstly, we learn concept boundary words which can be seen as the statistical feature of concepts in large-scale corpora using Google. The instinctive linguistics hypothesis we believe is that if a chunk is a lexical concept, there must be some certain "intimate" words abut on its "head" and its "tail". Secondly, we construct a classifier according to the concept boundary words and then distinguish concepts from chunks by it. We consider the conditional probability, the frequency and the entropy of the concept boundary words and propose three attributes models to build the classifier. Experiments are designed to compare the three classifiers and show the best method can validate concepts with an accuracy rate of 90.661%.