Length normalization in XML retrieval

Authors:
Jaap Kamps;Maarten de Rijke;Börkur Sigurbjörnsson
Affiliations:
University of Amsterdam, Amsterdam, The Netherlands;University of Amsterdam, Amsterdam, The Netherlands;University of Amsterdam, Amsterdam, The Netherlands
Venue:
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2004

Citing 12
Cited 36

Non-parametric significance tests of retrieval performance comparisons

Journal of Information Science
Effective retrieval of structured documents

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Document length normalization

Information Processing and Management: an International Journal - Special issue: history of information science
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Searching XML documents via XML fragments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
XML retrieval: what to retrieve?

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Processing content-oriented XPath queries

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Approximating the top-m passages in a parallel question answering system

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Report on the first twente data management workshop on XML databases and information retrieval

ACM SIGMOD Record
The Importance of Length Normalization for XML Retrieval

Information Retrieval
Controlling overlap in content-oriented XML retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance weighting for query independent evidence

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Score region algebra: building a transparent XML-R database

Proceedings of the 14th ACM international conference on Information and knowledge management
Using small XML elements to support relevance

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
XML search: languages, INEX and scoring

ACM SIGMOD Record
Web object retrieval

Proceedings of the 16th international conference on World Wide Web
Relevance measures for XML information retrieval

International Journal of Web and Grid Services
Structured Document Retrieval, Multimedia Retrieval, and Entity Ranking Using PF/Tijah

Focused Access to XML Documents
A generative retrieval model for structured documents

Proceedings of the 17th ACM conference on Information and knowledge management
Return specification inference and result clustering for keyword search on XML

ACM Transactions on Database Systems (TODS)
Feature- and query-based table of contents generation for XML documents

ECIR'07 Proceedings of the 29th European conference on IR research
SSRS: an XML information retrieval system

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Improving XML search by generating and utilizing informative result snippets

ACM Transactions on Database Systems (TODS)
Unified access to heterogeneous data in cultural heritage

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
A survey on XML focussed component retrieval

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Change-aware legal document retrieval model

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
TIJAH scratches INEX 2005: vague element selection, image search, overlap, and relevance feedback

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Parameter estimation for a simple hierarchical generative model for XML retrieval

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
The dynamic retrieval of XML elements

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
When a few highly relevant answers are enough

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
MultiText experiments for INEX 2004

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Hybrid XML retrieval revisited

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Analyzing the properties of XML fragments decomposed from the INEX document collection

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
A voting method for XML retrieval

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Mixture models, overlap, and structural hints in XML element retrieval

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Hierarchical language models for XML component retrieval

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Flexible retrieval based on the vector space model

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Effectively scoring for XML IR queries

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Using structural relationships for focused XML retrieval

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Why using structural hints in XML retrieval?

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Summarisation of the logical structure of XML documents

Information Processing and Management: an International Journal
Exploiting External Collections for Query Expansion

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length bias introduced by the amount of smoothing, and show the importance of extreme length priors for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate document length normalization. Even after increasing the minimal size of XML elements occurring in the index, the importance of an extreme length bias remains.