Approximate XML joins

Authors:
Sudipto Guha;H. V. Jagadish;Nick Koudas;Divesh Srivastava;Ting Yu
Affiliations:
University of Pennsylvania;University of Michigan;AT&T Labs-Research;AT&T Labs-Research;University of Illinois
Venue:
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Year:
2002

Citing 13
Cited 40

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Pattern matching algorithms

Pattern matching algorithms
Tree pattern matching

Pattern matching algorithms
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Change-Centric Management of Versions in an XML Warehouse

Proceedings of the 27th International Conference on Very Large Data Bases
A New Editing based Distance between Unordered Labeled Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Correlating XML data streams using tree-edit distance embeddings

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Semantic Similarity Search on Semistructured Data with the XXL Search Engine

Information Retrieval
On the use of hierarchical information in sequential mining-based XML document similarity computation

Knowledge and Information Systems
XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Similarity evaluation on tree-structured data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Topic-structure-based complementary information retrieval and its application

ACM Transactions on Asian Language Information Processing (TALIP)
Integrating XML data sources using approximate joins

ACM Transactions on Database Systems (TODS)
Query optimization in XML structured-document databases

The VLDB Journal — The International Journal on Very Large Data Bases
An incrementally maintainable index for approximate lookups in hierarchical data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A relation between edit distance for ordered trees and edit distance for Euler strings

Information Processing Letters
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
A heuristic approach for checking containment of generalized tree-pattern queries

Proceedings of the 17th ACM conference on Information and knowledge management
Containment of partially specified tree-pattern queries in the presence of dimension graphs

The VLDB Journal — The International Journal on Very Large Data Bases
Constant Factor Approximation of Edit Distance of Bounded Height Unordered Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Similarity join in metric spaces

ECIR'03 Proceedings of the 25th European conference on IR research
ESCAPE: an adaptive framework for managing and providing context information in emergency situations

EuroSSC'07 Proceedings of the 2nd European conference on Smart sensing and context
A fine-grained XML structural comparison approach

ER'07 Proceedings of the 26th international conference on Conceptual modeling
XML: some papers in a haystack

ACM SIGMOD Record
Approximate joins for XML using g-string

XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
Keyword search over relational databases: a metadata approach

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RTED: a robust algorithm for the tree edit distance

Proceedings of the VLDB Endowment
pest: Fast approximate keyword search in semantic data using eigenvector-based term propagation

Information Systems
No tag, a little nesting, and great XML keyword search

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
An abstract framework for generating maximal answers to queries

ICDT'05 Proceedings of the 10th international conference on Database Theory
KCAM: concentrating on structural similarity for XML fragments

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Semantic integration of tree-structured data using dimension graphs

Journal on Data Semantics IV
A native XML database supporting approximate match search

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

Web Semantics: Science, Services and Agents on the World Wide Web
Approximating tree edit distance through string edit distance

ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
Similarity join on XML based on k-generation set distance

WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
RWS-Diff: flexible and efficient change detection in hierarchical data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Test Pair Selection for Test Case Prioritization in Regression Testing for WS-BPEL Programs

International Journal of Web Services Research
On repairing structural problems in semi-structured data

Proceedings of the VLDB Endowment
A survey on tree edit distance lower bound estimation techniques for similarity join on XML data

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety sources. Hence, XML is likely to be the format through which data from multiple sources is integrated.In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.