A Study on Multi-word Extraction from Chinese Documents

  • Authors:
  • Wen Zhang;Taketoshi Yoshida;Xijin Tang

  • Affiliations:
  • School of Knowledge Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan 923-1292;School of Knowledge Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan 923-1292;Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, P.R. China 100080

  • Venue:
  • Advanced Web and NetworkTechnologies, and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

As a sequence of two or more consecutive individual words inherent with contextual semantics of individual words, multi-word attracts much attention from statistical linguistics and of extensive applications in text mining. In this paper, we carried out a series studies on multi-word extraction from Chinese documents. Firstly, we proposed a new statistical method, augmented mutual information (AMI), for words' dependency. Experiment results demonstrate that AMI method can produce a recall on average as 80% and its precision is about 20%-30%. Secondly, we attempt to utilize the variance of occurrence frequencies of individual words in a multi-word candidate to deal with the rare occurrence problem. But experimental results cannot validate the effectiveness of variance. Thirdly, we developed a syntactic method based on lexical regularities of Chinese multi-word to extract the multi-words from Chinese documents. Experimental results demonstrate that this syntactical method can produce a higher precision on average as 0.5521 than AMI method but it cannot produce a comparable recall. Finally, the possible breakthrough on combining statistical methods and syntactical methods is shed light on.