The Importance of Length Normalization for XML Retrieval

Authors:
Jaap Kamps;Maarten De Rijke;Börkur Sigurbjörnsson
Affiliations:
Aff1 Aff2;Informatics Institute, University of Amsterdam, Amsterdam;Informatics Institute, University of Amsterdam, Amsterdam
Venue:
Information Retrieval
Year:
2005

Citing 13
Cited 10

Non-parametric significance tests of retrieval performance comparisons

Journal of Information Science
Effective retrieval of structured documents

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Document length normalization

Information Processing and Management: an International Journal - Special issue: history of information science
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Searching XML documents via XML fragments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
XML retrieval: what to retrieve?

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Length normalization in XML retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

A comparison of document, sentence, and term event spaces

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
XML search: languages, INEX and scoring

ACM SIGMOD Record
Using Contextual Information to Improve Search in Email Archives

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Using topic shifts for focussed access to XML repositories

ECIR'07 Proceedings of the 29th European conference on IR research
UJM at INEX 2009 ad hoc track

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
The effect of structured queries and selective indexing on XML retrieval

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
The university of kaiserslautern at INEX 2005

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
SIRIUS: a lightweight XML indexing and approximate search system at INEX 2005

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Class normalization in centroid-based text categorization

Information Sciences: an International Journal
Extending information unit across media streams for improving retrieval effectiveness

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of element length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length-bias introduced by the amount of smoothing, and show the importance of extreme length bias for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate element length normalization. Even after restricting the minimal size of XML elements occurring in the index, the importance of an extreme explicit length bias remains.