Multilingual word sense discrimination: a comparative cross-linguistic study

  • Authors:
  • Alla Rozovskaya;Richard Sproat

  • Affiliations:
  • Univ. of Illinois at Urbana-Champaign, Urbana, IL;Univ. of Illinois at Urbana-Champaign, Urbana, IL

  • Venue:
  • ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a study that evaluates an approach to Word Sense Discrimination on three languages with different linguistic structures, English, Hebrew, and Russian. The goal of the study is to determine whether there are significant performance differences for the languages and to identify language-specific problems. The algorithm is tested on semantically ambiguous words using data from Wikipedia, an online encyclopedia. We evaluate the induced clusters against sense clusters created manually. The results suggest a correlation between the algorithm's performance and morphological complexity of the language. In particular, we obtain FScores of 0.68, 0.66 and 0.61 for English, Hebrew, and Russian, respectively. Moreover, we perform an experiment on Russian, in which the context terms are lemmatized. The lemma-based approach significantly improves the results over the word-based approach, by increasing the FScore by 16%. This result demonstrates the importance of morphological analysis for the task for morphologically rich languages like Russian.