A gossip-based approach for Internet-scale cardinality estimation of XPath queries over distributed semistructured data

Authors:
Vasil Slavov;Praveen Rao
Affiliations:
University of Missouri-Kansas City, Kansas City, USA;University of Missouri-Kansas City, Kansas City, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2014

Citing 59
Cited 0

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
On spreading a rumor

SIAM Journal on Applied Mathematics
Epidemic algorithms for replicated database maintenance

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
StatiX: making XML count

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Estimating Answer Sizes for XML Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Index Structures for Path Expressions

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Counting Twig Matches in a Tree

Proceedings of the 17th International Conference on Data Engineering
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Proceedings of the 27th International Conference on Very Large Data Bases
Kademlia: A Peer-to-Peer Information System Based on the XOR Metric

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Randomized rumor spreading

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Gossip-Based Computation of Aggregate Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Path sharing and predicate evaluation for high-performance XML filtering

ACM Transactions on Database Systems (TODS)
Data Indexing in Peer-to-Peer DHT Networks

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Selectivity Estimation for XML Twigs

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
XPath lookup queries in P2P networks

Proceedings of the 6th annual ACM international workshop on Web information and data management
IMAX: Incremental Maintenance of Schema-Based XML Statistics

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
On the spread of viruses on the internet

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Gossip-based aggregation in large dynamic networks

ACM Transactions on Computer Systems (TOCS)
Peer-to-peer management of XML data: issues and research challenges

ACM SIGMOD Record
XSEED: Accurate and Fast Cardinality Estimation for XPath Queries

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
XCluster Synopses for Structured XML Content

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
SketchTree: Approximate Tree Pattern Counts over Streaming Labeled Trees

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient gossip-based aggregate computation

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Understanding churn in peer-to-peer networks

Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
Designing a DHT for low latency and high throughput

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Distributed Density Estimation Using Non-parametric Statistics

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Locating data sources in large distributed systems

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
The promise, and limitations, of gossip protocols

ACM SIGOPS Operating Systems Review - Gossip-based computer networking
XRPC: interoperable and efficient distributed XQuery

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
DXQ: a distributed XQuery scripting language

XIME-P '07 Proceedings of the 4th international workshop on XQuery implementation, experience and perspectives
On the complexity of asynchronous gossip

Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
XTreeNet: democratic community search

Proceedings of the VLDB Endowment
A sampling approach for XML query selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Distributed similarity search in high dimensions using locality sensitive hashing

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Self-Join Size Estimation in Large-scale Distributed Data Systems

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
XML processing in DHT networks

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An Internet-Scale Service for Publishing and Locating XML Documents

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Sketch-Based Summarization of Ordered XML Streams

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Gossip Algorithms

Foundations and Trends® in Networking
Network gossip algorithms

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
Locating XML Documents in a Peer-to-Peer Network Using Distributed Hash Tables

IEEE Transactions on Knowledge and Data Engineering
Gossip-based distribution estimation in peer-to-peer networks

IPTPS'08 Proceedings of the 7th international conference on Peer-to-peer systems
Identifying frequent items in a network using gossip

Journal of Parallel and Distributed Computing
Towards large-scale sharing of electronic health records of cancer patients

Proceedings of the 1st ACM International Health Informatics Symposium
A hybrid approach for estimating document frequencies in unstructured P2P networks

Information Systems
A software tool for large-scale sharing and querying of clinical documents modeled using HL7 version 3 standard

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Efficient processing of XPath queries with structured overlay networks

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Fast Distributed Algorithms for Computing Separable Functions

IEEE Transactions on Information Theory
Tapestry: a resilient global-scale overlay for service deployment

IEEE Journal on Selected Areas in Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the problem of cardinality estimation of XPath queries over XML data stored in a distributed, Internet-scale environment such as a large-scale, data sharing system designed to foster innovations in biomedical and health informatics. The cardinality estimate of XPath expressions is useful in XQuery optimization, designing IR-style relevance ranking schemes, and statistical hypothesis testing. We present a novel gossip algorithm called XGossip, which given an XPath query estimates the number of XML documents in the network that contain a match for the query. XGossip is designed to be scalable, decentralized, and robust to failures--properties that are desirable in a large-scale distributed system. XGossip employs a novel divide-and-conquer strategy for load balancing and reducing the bandwidth consumption. We conduct theoretical analysis of XGossip in terms of accuracy of cardinality estimation, message complexity, and bandwidth consumption. We present a comprehensive performance evaluation of XGossip on Amazon EC2 using a heterogeneous collection of XML documents.