Improving effectiveness of mutual information for substantival multiword expression extraction

  • Authors:
  • Wen Zhang;Taketoshi Yoshida;Xijin Tang;Tu-Bao Ho

  • Affiliations:
  • School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan and Lab for Internet Software Technologies, Institute of Softw ...;School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan;Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, PR China;School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2009

Quantified Score

Hi-index 12.05

Visualization

Abstract

One of the deficiencies of mutual information is its poor capacity to measure association of words with unsymmetrical co-occurrence, which has large amounts for multi-word expression in texts. Moreover, threshold setting, which is decisive for success of practical implementation of mutual information for multi-word extraction, brings about many parameters to be predefined manually in the process of extracting multiword expressions with different number of individual words. In this paper, we propose a new method as EMICO (Enhanced Mutual Information and Collocation Optimization) to extract substantival multiword expression from text. Specifically, enhanced mutual information is proposed to measure the association of words and collocation optimization is proposed to automatically determine the number of individual words contained in a multiword expression when the multiword expression occurs in a candidate set. Our experiments showed that EMICO significantly improves the performance of substantival multiword expression extraction in comparison with a classic extraction method based on mutual information.