Integrating XML data sources using approximate joins

Authors:
Sudipto Guha;H. V. Jagadish;Nick Koudas;Divesh Srivastava;Ting Yu
Affiliations:
University of Pennsylvania;University of Michigan;University of Toronto;AT&T Labs--Research;North Carolina State University
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2006

Citing 30
Cited 10

Computational geometry: an introduction

Computational geometry: an introduction
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Efficient processing of spatial joins using R-trees

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Pattern matching algorithms

Pattern matching algorithms
Tree pattern matching

Pattern matching algorithms
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multidimensional access methods

ACM Computing Surveys (CSUR)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Searching in metric spaces

ACM Computing Surveys (CSUR)
Accelerating XPath location steps

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Holistic twig joins: optimal XML pattern matching

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
An Implementation and Performance Analysis of Spatial Data Access Methods

Proceedings of the Fifth International Conference on Data Engineering
Efficient Computation of Spatial Joins

Proceedings of the Ninth International Conference on Data Engineering
Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On Optimal Node Splitting for R-trees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Change-Centric Management of Versions in an XML Warehouse

Proceedings of the 27th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Structural Joins: A Primitive for Efficient XML Query Pattern Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

DB&IR: both sides now

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Retrieving XML data from heterogeneous sources through vague querying

ACM Transactions on Internet Technology (TOIT)
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
XML: some papers in a haystack

ACM SIGMOD Record
Generalizing prefix filtering to improve set similarity joins

Information Systems
Ingredients for accurate, fast, and robust XML similarity joins

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Similarity join on XML based on k-generation set distance

WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Leveraging the storage layer to support XML similarity joins in XDBMSs

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.