Towards a data-centric internet

Authors:
Leonidas Galanis;David Johns Dewitt
Affiliations:
-;-
Venue:
Towards a data-centric internet
Year:
2004

Citing 0
Cited 1

Scalable distributed aggregate computations through collaboration

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This thesis develops techniques for scalable data-centric distributed systems. The first part presents techniques for the ad-hoc formation of networks of data sources that allow scalable query processing. To quantify the performance of existing flooding-style query processing relative to the proposed techniques, a P2P system was built and evaluated. A node can join an existing P2P network by contacting any node already in the network. When a flooding-style query processing strategy is followed, nodes do not exchange data-specific information during network joining. Otherwise, a node provides some summary information about its data that is forwarded to all other peers. This information is stored in special structures, the peer indices, and is used to determine the relevant data sources for a query. Experiments show that peer indices on nodes is necessary for good performance. The second part presents the Catalog Service for mapping queries to data sources, which is based on Distributed Hash Tables (DHTs). Peers provide catalog information when they join the network. In the case of XML repositories each peer provides for each element (and attribute) a list of the paths that lead to it and an optional value summary. Thus, given an XPath query and the catalog information, one can determine the data sources that need to be accessed in order to process the query. Additionally, request load balancing methods are presented that make the Catalog Service scalable. The third and final part of this thesis explores distributed aggregate computations in data-centric P2P networks. They are important queries in large distributed systems because they allow the summarization of large distributed amounts of data. When an aggregate query is popular among the peers, data sources receive a large number of identical requests, which limits scalability. To address this problem the design of the Aggregation Layer is presented that assigns peers the maintenance of aggregates computations. Each maintainer acts as a computation hub by receiving updates from the data sources, whenever data changes, and by answering query requests. Simulation and system experiments prove the scalability and feasibility of the Aggregation Layer.