A software tool for building a statistical prefix processor

Authors:
Nikitas Karanikolas;Michael Vassilakopoulos;Nektarios Giannoulis
Affiliations:
TEI of Athens, Aigaleo, Greece;University of Central Greece, Lamia, Greece;TEI of Athens, Aigaleo, Greece
Venue:
Proceedings of the Fifth Balkan Conference in Informatics
Year:
2012

Citing 4
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A lemmatization method for Mongolian and its application to indexing for information retrieval

Information Processing and Management: an International Journal
A parametric methodology for text classification

Journal of Information Science

On the effect of stemming algorithms on extractive summarization: a case study

Proceedings of the 17th Panhellenic Conference on Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Retrieval or Text Classification need to match words between the user's input and the documents in a collection of texts. Matching of words is not a trivial process since words have grammatical (inflectional and derivational) variations. There are two main approaches for matching between inflected words: Stemming (removing word suffixes based on ad-hoc selected suffixes) and Lemmatizing (replacing the inflected form with the base form of a word). However, these approaches normalize the word variations in their rightmost side. We claim it will be beneficial to additionally concentrate on word normalization at the left side, by removing word prefixes. In this report, we present the architecture and functioning of a software tool that can be used as the first stage of a Statistical Prefix Processor, a system that could effectively remove prefixes from words and act as a preprocessing stage of text analysis applications. The tool we present is comprised of two stages/subtools. During the first stage, possible prefixes of words within a collection of texts are identified. During the second stage, a number of users (native speakers) process the text collection, automatically locate words that contain each stem and characterize the prefixes used with each stemmed word. After the text collection has been processed by all users, statistical conclusions can be drawn for each stemmed word and its associated prefixes.