Extracting terminologically relevant collocations in the translation of chinese monograph

Authors:
Byeong-Kwu Kang;Bao-Bao Chang;Yi-Rong Chen;Shi-Wen Yu
Affiliations:
The Institute of Computational Linguistics, Peking University, Beijing, China;The Institute of Computational Linguistics, Peking University, Beijing, China;The Institute of Computational Linguistics, Peking University, Beijing, China;The Institute of Computational Linguistics, Peking University, Beijing, China
Venue:
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Year:
2005

Citing 4
Cited 0

Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Semi-automatic acquisition of domain-specific translation lexicons

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Extraction of translation unit from Chinese-English parallel corpora

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Two-character Chinese word extraction based on hybrid of internal and contextual measures

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper suggests a methodology which is aimed to extract the terminologically relevant collocations for translation purposes. Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules. The extraction system used in our work operated at three steps: (1) Tokenization and POS tagging of the corpus; (2) Extraction of multi-word units using statistical measure; (3) Linguistic filtering to make use of syntactic patterns and stop-word list. As a result, hybrid method using linguistic filters proved to be a suitable method for selecting terminological collocations, it has considerably improved the precision of the extraction which is much higher than that of purely statistical method. In our test, hybrid method combining “Log-likelihood ratio” and “linguistic rules” had the best performance in the extraction. We believe that terminological collocations and phrases extracted in this way, could be used effectively either to supplement existing terminological collections or to be used in addition to traditional reference works.