Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Authors:
Zhanyi Liu;Haifeng Wang;Hua Wu;Sheng Li
Affiliations:
Harbin Institute of Technology Baidu;Baidu;Baidu;Harbin Institute of Technology
Venue:
ACM Transactions on Intelligent Systems and Technology (TIST)
Year:
2011

Citing 19
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
Querying across languages: a dictionary-based approach to multilingual information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
A "not-so-shallow" parser for collocational analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Using collocations for topic segmentation and link detection

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Synonymous collocation extraction using translation information

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Accurate collocation extraction using a multilingual parser

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Discriminative pruning of language models for Chinese word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Significance tests for the evaluation of ranking methods

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Collocation extraction based on modifiability statistics

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Combining association measures for collocation extraction

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Multi-word expression identification using sentence surface features

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
A hybrid approach for multiword expression identification

PROPOR'10 Proceedings of the 9th international conference on Computational Processing of the Portuguese Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact that the words in a collocation tend to co-occur in similar contexts as in bilingual word alignment. First, the monolingual corpus is replicated to generate a parallel corpus, in which each sentence pair consists of two identical sentences. Next, the monolingual word alignment algorithm is employed to align potentially collocated words. Finally, the aligned word pairs are ranked according to the alignment scores and candidates with higher scores are extracted as collocations. We conducted experiments on Chinese and English corpora respectively. Compared to previous approaches that use association measures to extract collocations from co-occurrence word pairs within a given window, our method achieves higher precision and recall. According to human evaluation, our method achieves precisions of 62% on a Chinese corpus and 64% on an English corpus. In particular, we can extract collocations with longer spans, achieving a higher precision of 83% on the long-span ( 6 words) Chinese collocations.