Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A lemmatization method for Mongolian and its application to indexing for information retrieval
Information Processing and Management: an International Journal
A parametric methodology for text classification
Journal of Information Science
On the effect of stemming algorithms on extractive summarization: a case study
Proceedings of the 17th Panhellenic Conference on Informatics
Hi-index | 0.00 |
Information Retrieval or Text Classification need to match words between the user's input and the documents in a collection of texts. Matching of words is not a trivial process since words have grammatical (inflectional and derivational) variations. There are two main approaches for matching between inflected words: Stemming (removing word suffixes based on ad-hoc selected suffixes) and Lemmatizing (replacing the inflected form with the base form of a word). However, these approaches normalize the word variations in their rightmost side. We claim it will be beneficial to additionally concentrate on word normalization at the left side, by removing word prefixes. In this report, we present the architecture and functioning of a software tool that can be used as the first stage of a Statistical Prefix Processor, a system that could effectively remove prefixes from words and act as a preprocessing stage of text analysis applications. The tool we present is comprised of two stages/subtools. During the first stage, possible prefixes of words within a collection of texts are identified. During the second stage, a number of users (native speakers) process the text collection, automatically locate words that contain each stem and characterize the prefixes used with each stemmed word. After the text collection has been processed by all users, statistical conclusions can be drawn for each stemmed word and its associated prefixes.