Word normalization and decompounding in mono- and bilingual IR

Authors:
Eija Airio
Affiliations:
Department of Information Studies, Tampere University, University of Tampere, Finland 33014
Venue:
Information Retrieval
Year:
2006

Citing 11
Cited 13

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Viewing stemming as recall enhancement

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval

Information Processing and Management: an International Journal
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
MARS: a retrieval tool on the basis of morphological analysis

SIGIR '84 Proceedings of the 7th annual international ACM SIGIR conference on Research and development in information retrieval
A Language-Independent Approach to European Text Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Utaclir @ CLEF 2001 - Effects of Compound Splitting and N-Gram Techniques

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Monolingual Document Retrieval for European Languages

Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval

Developing an automatic linguistic truncation operator for best-match retrieval of Finnish in inflected word form text database indexes

Journal of Information Science
Context sensitive stemming for web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Restricted inflectional form generation in management of morphological keyword variation

Information Retrieval
A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Automatic Generation of Frequent Case Forms of Query Keywords in Text Retrieval

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Current research issues and trends in non-English Web searching

Information Retrieval
Does dictionary based bilingual retrieval work in a non-normalized index?

Information Processing and Management: an International Journal
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
On document classification with self-organising maps

ICANNGA'09 Proceedings of the 9th international conference on Adaptive and natural computing algorithms
A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Web traffic profiling and characterization

Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Interpretation of coordinations, compound generation, and result fusion for query variants

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and Swedish. The source language of the bilingual runs is English, and the target languages are Finnish, German and Swedish. In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized decompounded index performs better than retrieval in a lemmatized compound index. The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds are used instead of phrases in Finnish, German and Swedish. No remarkable performance differences could be found between stemming and lemmatization.