XML stream processing using tree-edit distance embeddings

Authors:
Minos Garofalakis;Amit Kumar
Affiliations:
Bell Labs, Lucent Technologies, Murray Hill, NJ;Indian Institute of Technology, New Delhi, India
Venue:
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Year:
2005

Citing 46
Cited 16

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Randomized algorithms

Randomized algorithms
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Pattern matching algorithms

Pattern matching algorithms
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Characterizing memory requirements for queries over continuous data streams

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The string edit distance matching problem with moves

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Statistical synopses for graph-structured XML databases

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
On Efficient Matching of Streaming XML Documents and Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient Filtering of XML Documents for Selective Dissemination of Information

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Edit Distance with Move Operations

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Dimension Reduction in the \ell _1 Norm

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Correlating XML data streams using tree-edit distance embeddings

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Exploratory Data Mining and Data Cleaning

Exploratory Data Mining and Data Cleaning
Stream processing of XPath queries with predicates

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Algorithmic Applications of Low-Distortion Geometric Embeddings

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Fast Mining of Massive Tabular Data via Approximate Distance Computations

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Efficient Filtering of XML Documents with XPath Expressions

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Path sharing and predicate evaluation for high-performance XML filtering

ACM Transactions on Database Systems (TODS)
Approximate XML query answers

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Comparing data streams using Hamming norms (how to zero in)

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
XMark: a benchmark for XML data management

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Query processing for high-volume XML message brokering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
An incrementally maintainable index for approximate lookups in hierarchical data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A relation between edit distance for ordered trees and edit distance for Euler strings

Information Processing Letters
The power of two min-hashes for similarity search among hierarchical data objects

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes

SOFSEM '09 Proceedings of the 35th Conference on Current Trends in Theory and Practice of Computer Science
A Tree Distance Function Based on Multi-sets

New Frontiers in Applied Data Mining
Constant Factor Approximation of Edit Distance of Bounded Height Unordered Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Analysis of tree edit distance on XML data

CIIT '07 The Sixth IASTED International Conference on Communications, Internet, and Information Technology
Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes

Fundamenta Informaticae
Approximate joins for XML using g-string

XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
Approximate data exchange

ICDT'07 Proceedings of the 11th international conference on Database Theory
RTED: a robust algorithm for the tree edit distance

Proceedings of the VLDB Endowment
Approximating tree edit distance through string edit distance

ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
Test Pair Selection for Test Case Prioritization in Regression Testing for WS-BPEL Programs

International Journal of Web Services Research
A survey on tree edit distance lower bound estimation techniques for similarity join on XML data

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log2n log*n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.