Methods for extracting and classifying pairs of cognates and false friends

  • Authors:
  • Ruslan Mitkov;Viktor Pekar;Dimitar Blagoev;Andrea Mulloni

  • Affiliations:
  • Research Institute for Information and Language Processing, University of Wolverhampton, Wolverhampton, UK WV1 1SB;Research Institute for Information and Language Processing, University of Wolverhampton, Wolverhampton, UK WV1 1SB;Mathematics and Informatics Department, University of Plovdiv, Plovdiv, Bulgaria 4003;Research Institute for Information and Language Processing, University of Wolverhampton, Wolverhampton, UK WV1 1SB

  • Venue:
  • Machine Translation
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make use of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English---French, English---German, English---Spanish and French---Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming `ideal' extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.