Two-Word Collocation Extraction Using Monolingual Word Alignment Method

  • Authors:
  • Zhanyi Liu;Haifeng Wang;Hua Wu;Sheng Li

  • Affiliations:
  • Harbin Institute of Technology Baidu;Baidu;Baidu;Harbin Institute of Technology

  • Venue:
  • ACM Transactions on Intelligent Systems and Technology (TIST)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact that the words in a collocation tend to co-occur in similar contexts as in bilingual word alignment. First, the monolingual corpus is replicated to generate a parallel corpus, in which each sentence pair consists of two identical sentences. Next, the monolingual word alignment algorithm is employed to align potentially collocated words. Finally, the aligned word pairs are ranked according to the alignment scores and candidates with higher scores are extracted as collocations. We conducted experiments on Chinese and English corpora respectively. Compared to previous approaches that use association measures to extract collocations from co-occurrence word pairs within a given window, our method achieves higher precision and recall. According to human evaluation, our method achieves precisions of 62% on a Chinese corpus and 64% on an English corpus. In particular, we can extract collocations with longer spans, achieving a higher precision of 83% on the long-span ( 6 words) Chinese collocations.