Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Practical selectivity estimation through adaptive sampling
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling
Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Online prediction algorithms for databases and operating systems
Online prediction algorithms for databases and operating systems
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable content-addressable network
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Comparing Hybrid Peer-to-Peer Systems
Proceedings of the 27th International Conference on Very Large Data Bases
Kademlia: A Peer-to-Peer Information System Based on the XOR Metric
IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Complex Queries in DHT-based Peer-to-Peer Networks
IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
ACM Transactions on Computer Systems (TOCS)
Anthill: A Framework for the Development of Agent-Based Peer-to-Peer Systems
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
The impact of DHT routing geometry on resilience and proximity
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Routing networks for distributed hash tables
Proceedings of the twenty-second annual symposium on Principles of distributed computing
Processing set expressions over continuous update streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and
Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and
Gossip-Based Computation of Aggregate Information
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Epidemic-Style Proactive Aggregation in Large Overlay Networks
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Approximate Aggregation Techniques for Sensor Databases
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
The price of validity in dynamic networks
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mercury: supporting scalable multi-attribute range queries
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable distributed information management system
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Sketching streams through the net: distributed approximate query tracking
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing data-oriented overlay networks
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Designing a DHT for low latency and high throughput
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
SkipNet: a scalable overlay network with practical locality properties
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Querying the internet with PIER
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Messor: load-balancing through a swarm of autonomous agents
AP2PC'02 Proceedings of the 1st international conference on Agents and peer-to-peer computing
Replication, load balancing and efficient range query processing in DHTs
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
AESOP: altruism-endowed self-organizing peers
DBISP2P'04 Proceedings of the Second international conference on Databases, Information Systems, and Peer-to-Peer Computing
SPARQL query optimization on top of DHTs
ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
Peer-to-peer web search: euphoria, achievements, disillusionment, and future opportunities
From active data management to event-based systems and more
A sensor data collection method for rounding sink
Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Gossip-based density estimation in dynamic heterogeneous wireless sensor networks
International Journal of Autonomous and Adaptive Communications Systems
Hi-index | 0.00 |
Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.