Searching strategies for the Hungarian language

Authors:
Jacques Savoy
Affiliations:
Computer Science Department, University of Neuchatel, Rue Emile Argand 11, 2009 Neuchítel, Switzerland
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 18
Cited 7

Online information retrieval: concepts, principles, and techniques

Online information retrieval: concepts, principles, and techniques
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
Experimentation as a way of life: Okapi at TREC

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
A stop list for general text

ACM SIGIR Forum
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Cross-Language Evaluation Forum: Objectives, Results, Achievements

Information Retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005,Vienna, Austria, 21-23 September, 2005, ... Papers (Lecture Notes in Computer Science)

Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005,Vienna, Austria, 21-23 September, 2005, ... Papers (Lecture Notes in Computer Science)
Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)

Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Statistical and comparative evaluation of various indexing and search models

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Current research issues and trends in non-English Web searching

Information Retrieval
Ad hoc retrieval with the Persian language

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
A Fast Corpus-Based Stemmer

ACM Transactions on Asian Language Information Processing (TALIP)
A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)
A fuzzy ranking approach for improving search results in Turkish as an agglutinative language

Expert Systems with Applications: An International Journal
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.