Poor man’s stemming: unsupervised recognition of same-stem words

Authors:
Harald Hammarström
Affiliations:
Chalmers University, Gothenburg, Sweden
Venue:
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Year:
2006

Citing 15
Cited 2

Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Guessing morphology from terms and corpora

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Language-Specific Stemming in Information Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Strength and similarity of affix removal stemming algorithms

ACM SIGIR Forum
A novel method for stemmer generation based on hidden markov models

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Unsupervised learning of Arabic stemming using a parallel corpus

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Unsupervised learning of morphology for building lexicon for a highly inflectional language

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised learning of morphology using a novel directed search algorithm: taking the first step

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised learning of morphology without morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised discovery of morphologically related words based on orthographic and semantic similarity

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Efficient unsupervised recursive word segmentation using minimum description length

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
A naive theory of affixation and an algorithm for extraction

SIGPHON '06 Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology

Optimal stem identification in presence of suffix list

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
A methodology for building simple but robust stemmers without language knowledge: overview, data model and ranking algorithm

Proceedings of the 14th International Conference on Computer Systems and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.