Swordfish: an unsupervised Ngram based approach to morphological analysis

Authors:
Chris Jordan;John Healy;Vlado Keselj
Affiliations:
Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada
Venue:
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2006

Citing 3
Cited 0

Modern Information Retrieval

Modern Information Retrieval
Unsupervised learning of the morphology of a natural language

Computational Linguistics
MARSYAS: a framework for audio analysis

Organised Sound

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.