BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Change detection in hierarchically structured information
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Pattern matching algorithms
Pattern matching algorithms
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces
SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Change-Centric Management of Versions in an XML Warehouse
Proceedings of the 27th International Conference on Very Large Data Bases
A New Editing based Distance between Unordered Labeled Trees
CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Detecting Changes in XML Documents
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Correlating XML data streams using tree-edit distance embeddings
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Semantic Similarity Search on Semistructured Data with the XXL Search Engine
Information Retrieval
Knowledge and Information Systems
XML stream processing using tree-edit distance embeddings
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Similarity evaluation on tree-structured data
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Topic-structure-based complementary information retrieval and its application
ACM Transactions on Asian Language Information Processing (TALIP)
Integrating XML data sources using approximate joins
ACM Transactions on Database Systems (TODS)
Query optimization in XML structured-document databases
The VLDB Journal — The International Journal on Very Large Data Bases
An incrementally maintainable index for approximate lookups in hierarchical data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A relation between edit distance for ordered trees and edit distance for Euler strings
Information Processing Letters
Efficient Similarity Search for Tree-Structured Data
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
A heuristic approach for checking containment of generalized tree-pattern queries
Proceedings of the 17th ACM conference on Information and knowledge management
Containment of partially specified tree-pattern queries in the presence of dimension graphs
The VLDB Journal — The International Journal on Very Large Data Bases
Constant Factor Approximation of Edit Distance of Bounded Height Unordered Trees
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
Similarity join in metric spaces
ECIR'03 Proceedings of the 25th European conference on IR research
ESCAPE: an adaptive framework for managing and providing context information in emergency situations
EuroSSC'07 Proceedings of the 2nd European conference on Smart sensing and context
A fine-grained XML structural comparison approach
ER'07 Proceedings of the 26th international conference on Conceptual modeling
XML: some papers in a haystack
ACM SIGMOD Record
Approximate joins for XML using g-string
XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
Keyword search over relational databases: a metadata approach
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RTED: a robust algorithm for the tree edit distance
Proceedings of the VLDB Endowment
No tag, a little nesting, and great XML keyword search
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
XML duplicate detection using sorted neighborhoods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
An abstract framework for generating maximal answers to queries
ICDT'05 Proceedings of the 10th international conference on Database Theory
KCAM: concentrating on structural similarity for XML fragments
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Semantic integration of tree-structured data using dimension graphs
Journal on Data Semantics IV
A native XML database supporting approximate match search
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration
BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Web Semantics: Science, Services and Agents on the World Wide Web
Approximating tree edit distance through string edit distance
ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
Similarity join on XML based on k-generation set distance
WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
RWS-Diff: flexible and efficient change detection in hierarchical data
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Test Pair Selection for Test Case Prioritization in Regression Testing for WS-BPEL Programs
International Journal of Web Services Research
On repairing structural problems in semi-structured data
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety sources. Hence, XML is likely to be the format through which data from multiple sources is integrated.In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets.