The utrecht blend: basic ingredients for an XML retrieval system

Authors:
Roelof van Zwol;Frans Wiering;Virginia Dignum
Affiliations:
Centre for Content and Knowledge Engineering, Utrecht University, Utrecht, The Netherlands;Centre for Content and Knowledge Engineering, Utrecht University, Utrecht, The Netherlands;Centre for Content and Knowledge Engineering, Utrecht University, Utrecht, The Netherlands
Venue:
INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval
Year:
2004

Citing 2
Cited 2

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval

A survey on XML focussed component retrieval

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Bricks: the building blocks to tackle query formulation in structured document retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploiting the structure of a document allows for more powerful information retrieval techniques. In this article a basic approach is discussed for the retrieval of XML document fragments. Based on a vector-space model for text retrieval we aim at investigating various strategies that influence the retrieval performance of an XML-based IR system. The first extension of the system uses a schema-based approach that assumes that authors tag their text to emphasise on particular pieces of content that are of importance. Based on the schema used by the document collection, the system can easily derive the children of mixed content nodes. Our hypothesis is that those child nodes are more important than other nodes. The second approach discussed here is based on a horizontal fragmentation of the inverse document frequencies, used by the vector space model. The underlying assumption states that the distribution of terms is related to the semantical structure of the document. However, we observed that the IEEE collection is not a good example of semantic tagging. The third approach investigates how the performance of the retrieval system can improve for the 'Content Only' task by using a set of a-priori defined cut-off nodes that define ‘logical' document fragments that are of interest to a user.