Optimal stem identification in presence of suffix list

Authors:
N. Vasudevan;Pushpak Bhattacharyya
Affiliations:
Computer Science and Engg Department, IIT Bombay, Mumbai, India;Computer Science and Engg Department, IIT Bombay, Mumbai, India
Venue:
CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Year:
2012

Citing 10
Cited 0

Unsupervised learning of the morphology of a natural language

Computational Linguistics
Memory-based morphological analysis

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Unsupervised learning of morphology for English and Inuktitut

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Unsupervised learning of morphology using a novel directed search algorithm: taking the first step

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
A naive theory of affixation and an algorithm for extraction

SIGPHON '06 Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology
Graphical models over multiple strings

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
ParaMor and Morpho challenge 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Unsupervised learning of morphology

Computational Linguistics
Poor man’s stemming: unsupervised recognition of same-stem words

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.