Rare word translation extraction from aligned comparable documents

Authors:
Emmanuel Prochasson;Pascale Fung
Affiliations:
Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong;Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 11
Cited 2

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Looking for candidate translational equivalents in specialized, comparable corpora

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Mining comparable bilingual text corpora for cross-language information integration

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Finding translations for low-frequency words in comparable corpora

Machine Translation
Decompounding query keywords from compounding languages

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Revisiting context-based projection methods for term-translation spotting in comparable corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics

Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-)parallel translation equivalents

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Bilingual lexicon extraction from comparable corpora using label propagation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1 to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.