Class-based n-gram models of natural language
Computational Linguistics
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Unsupervised learning of the morphology of a natural language
Computational Linguistics
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Statistical phrase-based translation
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
An overview of the Trilinos project
ACM Transactions on Mathematical Software (TOMS) - Special issue on the Advanced CompuTational Software (ACTS) Collection
Building a shallow Arabic Morphological Analyzer in one day
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Generating learner-like morphological errors in Russian
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
On Morphological Analysis for Learner Language, Focusing on Russian
Research on Language and Computation
Hi-index | 0.00 |
We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.