Multilingual word sense discrimination: a comparative cross-linguistic study

Authors:
Alla Rozovskaya;Richard Sproat
Affiliations:
Univ. of Illinois at Urbana-Champaign, Urbana, IL;Univ. of Illinois at Urbana-Champaign, Urbana, IL
Venue:
ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Year:
2007

Citing 6
Cited 0

Using multiple knowledge sources for word sense discrimination

Computational Linguistics
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic word sense discrimination

Computational Linguistics - Special issue on word sense disambiguation
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A lemma-based approach to a maximum entropy word sense disambiguation system for Dutch

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a study that evaluates an approach to Word Sense Discrimination on three languages with different linguistic structures, English, Hebrew, and Russian. The goal of the study is to determine whether there are significant performance differences for the languages and to identify language-specific problems. The algorithm is tested on semantically ambiguous words using data from Wikipedia, an online encyclopedia. We evaluate the induced clusters against sense clusters created manually. The results suggest a correlation between the algorithm's performance and morphological complexity of the language. In particular, we obtain FScores of 0.68, 0.66 and 0.61 for English, Hebrew, and Russian, respectively. Moreover, we perform an experiment on Russian, in which the context terms are lemmatized. The lemma-based approach significantly improves the results over the word-based approach, by increasing the FScore by 16%. This result demonstrates the importance of morphological analysis for the task for morphologically rich languages like Russian.