Modern Information Retrieval
Unsupervised learning of the morphology of a natural language
Computational Linguistics
MARSYAS: a framework for audio analysis
Organised Sound
Hi-index | 0.00 |
Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.