Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Authors:
Nikos Ntarmos;Peter Triantafillou;Gerhard Weikum
Affiliations:
University of Patras, Greece;University of Patras, Greece;M.P.I.I., Germany
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 7

ALVIS peers: a scalable full-text peer-to-peer retrieval engine

P2PIR '06 Proceedings of the international workshop on Information retrieval in peer-to-peer networks
Data allocation scheme based on term weight for P2P information retrieval

Proceedings of the 9th annual ACM international workshop on Web information and data management
Distinct value estimation on peer-to-peer networks

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Efficiently Handling Dynamics in Distributed Link Based Authority Analysis

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

ACM Transactions on Computer Systems (TOCS)
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
PINTS: peer-to-peer infrastructure for tagging systems

IPTPS'08 Proceedings of the 7th international conference on Peer-to-peer systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Counting in general, and estimating the cardinality of (multi-) sets in particular, is highly desirable for a large variety of applications, representing a foundational block for the efficient deployment and access of emerging internetscale information systems. Examples of such applications range from optimizing query access plans in internet-scale databases, to evaluating the significance (rank/score) of various data items in information retrieval applications. The key constraints that any acceptable solution must satisfy are: (i) efficiency: the number of nodes that need be contacted for counting purposes must be small in order to enjoy small latency and bandwidth requirements; (ii) scalability, seemingly contradicting the efficiency goal: arbitrarily large numbers of nodes nay need to add elements to a (multi-) set, which dictates the need for a highly distributed solution, avoiding server-based scalability, bottleneck, and availability problems; (iii) access and storage load balancing: counting and related overhead chores should be distributed fairly to the nodes of the network; (iv) accuracy: tunable, robust (in the presence of dynamics and failures) and highly accurate cardinality estimation; (v) simplicity and ease of integration: special, solution-specific indexing structures should be avoided. In this paper, first we contribute a highly-distributed, scalable, efficient, and accurate (multi-) set cardinality estimator. Subsequently, we show how to use our solution to build and maintain histograms, which have been a basic building block for query optimization for centralized databases, facilitating their porting into the realm of internet-scale data networks.