Comparison of semantic similarity for different languages using the Google n-gram corpus and second- order co-occurrence measures

Authors:
Colette Joubarne;Diana Inkpen
Affiliations:
School of Information Technology and Engineering, University of Ottawa, ON, Canada;School of Information Technology and Engineering, University of Ottawa, ON, Canada
Venue:
Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Year:
2011

Citing 5
Cited 0

Contextual correlates of synonymy

Communications of the ACM
Semantic similarity for detecting recognition errors in automatic speech transcripts

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
How well do semantic relatedness measures perform?: a meta-study

STEP '08 Proceedings of the 2008 Conference on Semantics in Text Processing
Cross-lingual semantic relatedness using encyclopedic knowledge

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite the growth in digitization of data, there are still many languages without sufficient corpora to achieve valid measures of semantic similarity. If it could be shown that manually-assigned similarity scores from one language can be transferred to another language, then semantic similarity values could be used for languages with fewer resources. We test an automatic word similarity measure based on second-order co-occurrences in the Google ngram corpus, for English, German, and French. We show that the scores manually-assigned in the experiments of Rubenstein and Goodenough's for 65 English word pairs can be transferred directly into German and French. We do this by conducting human evaluation experiments for French word pairs (and by using similarly produced scores for German). We show that the correlation between the automatically-assigned semantic similarity scores and the scores assigned by human evaluators is not very different when using the Rubenstein and Goodenough's scores across language, compared to the language-specific scores.