Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

Authors:
Daniel Andrade;Takuya Matsuzaki;Jun’ichi Tsujii
Affiliations:
University of Tokyo;University of Tokyo;University of Tokyo
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2012

Citing 22
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Multilingual Information Access

ESSIR '00 Proceedings of the Third European Summer-School on Lectures on Information Retrieval-Revised Lectures
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Looking for candidate translational equivalents in specialized, comparable corpora

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
An approach based on multilingual thesauri and model combination for bilingual lexicon extraction

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Learning a translation lexicon from monolingual corpora

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
A geometric view on bilingual lexicon extraction from comparable corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Online large-margin training of dependency parsers

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A discriminative framework for bilingual word alignment

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Finding translations for low-frequency words in comparable corpora

Machine Translation
Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
A discriminative candidate generator for string transformations

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning Spanish-Galician translation equivalents using a comparable corpus and a bilingual dictionary

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Robust measurement and comparison of context similarity for finding translation pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Revisiting context-based projection methods for term-translation spotting in comparable corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Bilingual lexicon extraction from comparable corpora using in-domain terms

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A linguistically grounded graph model for bilingual lexicon extraction

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Effective use of dependency structure for bilingual lexicon creation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bilingual dictionaries can be automatically extended by new translations using comparable corpora. The general idea is based on the assumption that similar words have similar contexts across languages. However, previous studies have mainly focused on Indo-European languages, or use only a bag-of-words model to describe the context. Furthermore, we argue that it is helpful to extract only the statistically significant context, instead of using all context. The present approach addresses these issues in the following manner. First, based on the context of a word with an unknown translation (query word), we extract salient pivot words. Pivot words are words for which a translation is already available in a bilingual dictionary. For the extraction of salient pivot words, we use a Bayesian estimation of the point-wise mutual information to measure statistical significance. In the second step, we match these pivot words across languages to identify translation candidates for the query word. We therefore calculate a similarity score between the query word and a translation candidate using the probability that the same pivots will be extracted for both the query word and the translation candidate. The proposed method uses several context positions, namely, a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence. In order to make these context positions comparable across Japanese and English, which are unrelated languages, we use several heuristics to adjust the dependency trees appropriately. We demonstrate that the proposed method significantly increases the accuracy of word translations, as compared to previous methods.