Preparing heterogeneous XML for full-text search

Authors:
Miro Lehtonen
Affiliations:
University of Helsinki, Finland
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2006

Citing 16
Cited 1

An Information Retrieval Approach for Automatically Constructing Software Libraries

IEEE Transactions on Software Engineering
A flexible model for retrieval of SGML documents

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The XML handbook

The XML handbook
XIRQL: a query language for information retrieval in XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A survey in indexing and searching XML documents

Journal of the American Society for Information Science and Technology - XML
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
HyREX: hyper-media retrieval engine for XML

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document Visualization on Small Displays

MDM '03 Proceedings of the 4th International Conference on Mobile Data Management
Searching XML documents via XML fragments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
An English Japanese machine translation system of the titles of scientific and engineering papers

COLING '82 Proceedings of the 9th conference on Computational linguistics - Volume 1
The overlap problem in content-oriented XML retrieval evaluation

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Configurable indexing and ranking for XML information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
An efficient and versatile query engine for TopX search

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Score region algebra: building a transparent XML-R database

Proceedings of the 14th ACM international conference on Information and knowledge management
Evaluation in (XML) information retrieval: expected precision-recall with user modelling (EPRUM)

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
EXTIRP 2004: towards heterogeneity

INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval

Phrase Detection in the Wikipedia

Focused Access to XML Documents

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML retrieval is facing new challenges when applied to heterogeneous XML documents, where next to nothing about the document structure can be taken for granted. We have developed solutions where some of the heterogeneity issues are addressed. Our fragment selection algorithm selectively divides a heterogeneous document collection into equi-sized fragments with full-text content. If the content is considered too data-oriented, it is not accepted. The algorithm needs no information about element names. In addition, three techniques for fragment expansion are presented, all of which yield a 13--17% average improvement in average precision. These techniques and algorithms are among the first steps in developing document-type-independent indexing methods for the full text in heterogeneous XML collections.