Word Sense Disambiguation of Farsi Homographs Using Thesaurus and Corpus

Authors:
Raheleh Makki;Mohammad Mehdi Homayounpour
Affiliations:
Laboratory for Intelligent Sound and Speech Processing, Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran;Laboratory for Intelligent Sound and Speech Processing, Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran
Venue:
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Year:
2008

Citing 6
Cited 2

Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Estimating upper and lower bounds on the performance of word-sense disambiguation programs

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Introduction to Information Retrieval

Introduction to Information Retrieval

Cross-lingual word sense disambiguation for languages with scarce resources

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Towards automatic acquisition of a fully sense tagged corpus for persian

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes disambiguation of Farsi homographs in unrestricted text using thesaurus and corpus. The proposed method is based on [1] with some differences. These differences consist of first using collocational information to avoid the collection of spurious contexts caused by polysemous words in thesaurus categories, and second contribution of all words in the test data context, even those not appeared in the collected contexts to the calculation of the conceptual classes' score. Using a Farsi corpus and a Farsi thesaurus, this method correctly disambiguated 91.46% of the instances of 15 Farsi homographs. This method was compared to three supervised corpus based methods including Naïve Bayes, Exemplar-based, and Decision List. Unlike supervised methods, this method needs no training data, and has a good performance on disambiguation of uncommon words. In addition, this method can be used for removing some kinds of morphological ambiguities.