A software tool for building a statistical prefix processor

  • Authors:
  • Nikitas Karanikolas;Michael Vassilakopoulos;Nektarios Giannoulis

  • Affiliations:
  • TEI of Athens, Aigaleo, Greece;University of Central Greece, Lamia, Greece;TEI of Athens, Aigaleo, Greece

  • Venue:
  • Proceedings of the Fifth Balkan Conference in Informatics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information Retrieval or Text Classification need to match words between the user's input and the documents in a collection of texts. Matching of words is not a trivial process since words have grammatical (inflectional and derivational) variations. There are two main approaches for matching between inflected words: Stemming (removing word suffixes based on ad-hoc selected suffixes) and Lemmatizing (replacing the inflected form with the base form of a word). However, these approaches normalize the word variations in their rightmost side. We claim it will be beneficial to additionally concentrate on word normalization at the left side, by removing word prefixes. In this report, we present the architecture and functioning of a software tool that can be used as the first stage of a Statistical Prefix Processor, a system that could effectively remove prefixes from words and act as a preprocessing stage of text analysis applications. The tool we present is comprised of two stages/subtools. During the first stage, possible prefixes of words within a collection of texts are identified. During the second stage, a number of users (native speakers) process the text collection, automatically locate words that contain each stem and characterize the prefixes used with each stemmed word. After the text collection has been processed by all users, statistical conclusions can be drawn for each stemmed word and its associated prefixes.