Bilingual lexicon generation using non-aligned signatures

Authors:
Daphna Shezaf;Ari Rappoport
Affiliations:
Hebrew University of Jerusalem;Hebrew University of Jerusalem
Venue:
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Year:
2010

Citing 15
Cited 7

Word association norms, mutual information, and lexicography

Computational Linguistics
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Construction of a bilingual dictionary intermediated by a third language

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Extracting word correspondences from bilingual corpora based on word co-occurrences information

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Extraction of lexical translations from non-aligned corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Learning a translation lexicon from monolingual corpora

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Inducing translation lexicons via diverse similarity measures and bridge languages

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Finding translations for low-frequency words in comparable corpora

Machine Translation
Statistical machine translation

ACM Computing Surveys (CSUR)
Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Automatic generation of translation dictionaries using intermediary languages

CrossLangInduction '06 Proceedings of the International Workshop on Cross-Language Knowledge Induction
Unsupervised concept discovery in Hebrew using simple unsupervised word prefix segmentation for Hebrew and Arabic

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Compiling a massive, multilingual dictionary via probabilistic inference

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Automatic construction of a transfer dictionary considering directionality

MLR '04 Proceedings of the Workshop on Multilingual Linguistic Ressources

Clustering comparable corpora for bilingual lexicon extraction

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Building and using comparable corpora for domain-specific bilingual lexicon extraction

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Bootstrapping bilingual lexicons from comparable corpora for closely related languages

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Analyzing methods for improving precision of pivot based bilingual dictionaries

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Detecting highly confident word translations from comparable corpora without any prior knowledge

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
A Fast and Accurate Method for Bilingual Opinion Lexicon Extraction

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inefficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.