Preparing heterogeneous XML for full-text search

  • Authors:
  • Miro Lehtonen

  • Affiliations:
  • University of Helsinki, Finland

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML retrieval is facing new challenges when applied to heterogeneous XML documents, where next to nothing about the document structure can be taken for granted. We have developed solutions where some of the heterogeneity issues are addressed. Our fragment selection algorithm selectively divides a heterogeneous document collection into equi-sized fragments with full-text content. If the content is considered too data-oriented, it is not accepted. The algorithm needs no information about element names. In addition, three techniques for fragment expansion are presented, all of which yield a 13--17% average improvement in average precision. These techniques and algorithms are among the first steps in developing document-type-independent indexing methods for the full text in heterogeneous XML collections.