Indexing and retrieval of a Greek corpus

Authors:
Georgios Paltoglou;Michail Salampasis;Fotis Lazarinis
Affiliations:
University of Macedonia, Thessaloniki, Greece;Alexander Technological Educational Institute of Thessaloniki, Thessaloniki, Greece;Technological Educational Institute of Mesolonghi, Mesolonghi, Greece
Venue:
Proceedings of the 2nd ACM workshop on Improving non english web searching
Year:
2008

Citing 7
Cited 0

An algorithm for suffix stripping

Readings in information retrieval
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
An Approach to Designing Very Fast Approximate String Matching Algorithms

IEEE Transactions on Knowledge and Data Engineering
Engineering and utilizing a stopword list in Greek Web retrieval

Journal of the American Society for Information Science and Technology
Lemmatization and stopword elimination in Greek web searching

EATIS '07 Proceedings of the 2007 Euro American conference on Telematics and information systems
EuroGOV: engineering a multilingual web corpus

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Overview of WebCLEF 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Greek is one of the most difficult languages to handle in Web Information Retrieval (IR) related tasks. Its difficulty stems from the fact that it is grammatically, morphologically and orthographically more complex than the lingua franca of IR, English. In this paper, we address a significant number of issues that originate from the Greek language. We use a number of techniques to determine the correct encoding that is used by web pages written in Greek. We test the effect of using a Greek stopword list in a realistic and controlled Web environment. We employ a character mapping scheme, in order to overcome the problem of the diversity of diacritics used in the language, such as accents and diaeresis. We utilize word distance and fuzzy similarity metrics in order to make up for the different forms that nouns, verbs and articles appear because of conjugations and inflections and additionally handle greeklish queries, a transliterated form of Greek. The conducted experiments present some effective ways to increase the accuracy in Greek IR tasks.