A Study on Multi-word Extraction from Chinese Documents

Authors:
Wen Zhang;Taketoshi Yoshida;Xijin Tang
Affiliations:
School of Knowledge Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan 923-1292;School of Knowledge Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan 923-1292;Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, P.R. China 100080
Venue:
Advanced Web and NetworkTechnologies, and Applications
Year:
2008

Citing 11
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
MURAX: a robust linguistic approach for question answering using an on-line encyclopedia

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
MARSYAS: a framework for audio analysis

Organised Sound
Towards automatic extraction of monolingual and bilingual terminology

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Surface grammatical analysis for the extraction of terminological noun phrases

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 3
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Automatic glossary extraction: beyond terminology identification

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Extraction of Chinese compound words: an experimental study on a very large corpus

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Identifying Multi-Word Terms by Text-Segments

WAIMW '06 Proceedings of the Seventh International Conference on Web-Age Information Management Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

As a sequence of two or more consecutive individual words inherent with contextual semantics of individual words, multi-word attracts much attention from statistical linguistics and of extensive applications in text mining. In this paper, we carried out a series studies on multi-word extraction from Chinese documents. Firstly, we proposed a new statistical method, augmented mutual information (AMI), for words' dependency. Experiment results demonstrate that AMI method can produce a recall on average as 80% and its precision is about 20%-30%. Secondly, we attempt to utilize the variance of occurrence frequencies of individual words in a multi-word candidate to deal with the rare occurrence problem. But experimental results cannot validate the effectiveness of variance. Thirdly, we developed a syntactic method based on lexical regularities of Chinese multi-word to extract the multi-words from Chinese documents. Experimental results demonstrate that this syntactical method can produce a higher precision on average as 0.5521 than AMI method but it cannot produce a comparable recall. Finally, the possible breakthrough on combining statistical methods and syntactical methods is shed light on.