Fragment-based approximate retrieval in highly heterogeneous XML collections

Authors:
I. Sanz;M. Mesiti;G. Guerrini;R. Berlanga
Affiliations:
Department of Computer Science and Engineering, Universitat Jaume I, Avg. de Vicent Sos Baynat, s/n E-12071 Castelló, Spain;Dipartimento di Informatica e Comunicazione, Universití degli Studi di Milano, Via Comelico, 39/41 I-20135 Milano, Italy;Dipartimento di Informatica e Scienze dell'Informazione, Universití degli Studi di Genova, Via Dodecaneso, 35 I-16146 Genova, Italy;Department of Computer Science and Engineering, Universitat Jaume I, Avg. de Vicent Sos Baynat, s/n E-12071 Castelló, Spain
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 16
Cited 10

Flexible queries over semistructured data

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
XIRQL: a query language for information retrieval in XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A survey in indexing and searching XML documents

Journal of the American Society for Information Science and Technology - XML
An expressive and efficient language for XML information retrieval

Journal of the American Society for Information Science and Technology - XML
Accelerating XPath location steps

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Schema-Driven Evaluation of Approximate Tree-Pattern Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Tree Pattern Relaxation

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
ATreeGrep: Approximate Searching in Unordered Trees

SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Adding Relevance to XML

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
eXist: An Open Source Native XML Database

Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Adaptive Processing of Top-k Queries in XML

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Structure and content scoring for XML

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Deriving similarity for Semantic Web using similarity graph

Journal of Intelligent Information Systems
Approximate subtree identification in heterogeneous XML documents collections

XSym'05 Proceedings of the Third international conference on Database and XML Technologies

Designing Similarity Measures for XML

ER '08 Proceedings of the 27th International Conference on Conceptual Modeling
Effective XML content and structure retrieval with relevance ranking

Proceedings of the 18th ACM conference on Information and knowledge management
Requirements gathering in a model-based approach for the design of multi-similarity systems

Proceedings of the first international workshop on Model driven service engineering and data quality and security
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Effective pruning for XML structural match queries

Data & Knowledge Engineering
Graph homomorphism revisited for graph matching

Proceedings of the VLDB Endowment
On nonmetric similarity search problems in complex domains

ACM Computing Surveys (CSUR)
Building data warehouses with semantic web data

Decision Support Systems
Evaluating PageRank methods for structural sense ranking in labeled tree data

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Exploring dictionary-based semantic relatedness in labeled tree data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the heterogeneous nature of XML data for internet applications exact matching of queries is often inadequate. The need arises to quickly identify subtrees of XML documents in a collection that are similar to a given pattern. Similarity involves both tags, that are not required to coincide, and structure, in which not all the relationships among nodes in the tree structure are strictly preserved. In this paper we present an efficient approach to the identification of similar subtrees, relying on ad-hoc indexing structures. The approach allows to quickly detect, in a heterogeneous document collection, the minimal portions that exhibit some similarity with the pattern. These candidate portions are then ranked according to their actual similarity. The approach supports different notions of similarity, thus it can be customized to different application domains. In the paper, three different similarity measures are proposed and compared. The approach is experimentally validated and the experimental results are extensively discussed.