Measuring spelling similarity for cognate identification

Authors:
Luís Gomes;José Gabriel Pereira Lopes
Affiliations:
Centro de Informática e Tecnologias da Informação, Universidade Nova de Lisboa, Caparica, Portugal;Centro de Informática e Tecnologias da Informação, Universidade Nova de Lisboa, Caparica, Portugal
Venue:
EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
Year:
2011

Citing 3
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Bitext maps and alignment via pattern recognition

Computational Linguistics
Longest sorted sequence algorithm for parallel text alignment

EUROCAST'05 Proceedings of the 10th international conference on Computer Aided Systems Theory

Extraction of bilingual cognates from wikipedia

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most commonly used measures of string similarity, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as "ph" and "f" in English-Portuguese cognates "phase" and "fase". Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of cognaticity. This paper describes SpSim, a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance -based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.