Methods for extracting and classifying pairs of cognates and false friends

Authors:
Ruslan Mitkov;Viktor Pekar;Dimitar Blagoev;Andrea Mulloni
Affiliations:
Research Institute for Information and Language Processing, University of Wolverhampton, Wolverhampton, UK WV1 1SB;Research Institute for Information and Language Processing, University of Wolverhampton, Wolverhampton, UK WV1 1SB;Mathematics and Informatics Department, University of Plovdiv, Plovdiv, Bulgaria 4003;Research Institute for Information and Language Processing, University of Wolverhampton, Wolverhampton, UK WV1 1SB
Venue:
Machine Translation
Year:
2007

Citing 18
Cited 2

Evaluation techniques for automatic semantic extraction: comparing syntactic and window based approaches

Corpus processing for lexical acquisition
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Machine transliteration

Computational Linguistics
Bitext maps and alignment via pattern recognition

Computational Linguistics
A new algorithm for the alignment of phonetic sequences

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A non-projective dependency parser

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Extraction of lexical translations from non-aligned corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Identifying cognates by phonetic and semantic similarity

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Multipath translation lexicon induction via bridge languages

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
A geometric view on bilingual lexicon extraction from comparable corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Semi-supervised learning of partial cognates using bilingual bootstrapping

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Cognate mapping: a heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Identification of confusable drug names: a new approach and evaluation methodology

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

A knowledge-rich approach to measuring the similarity between Bulgarian and Russian words

MRTECEEL '09 Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages
Statistical machine translation enhancements through linguistic levels: A survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make use of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English---French, English---German, English---Spanish and French---Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming `ideal' extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.