Transitive closure and recursive Datalog implemented on clusters

Authors:
Foto N. Afrati;Jeffrey D. Ullman
Affiliations:
National Technical University of Athens, Greece;Stanford University
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 30
Cited 3

The parallel complexity of simple chain queries

PODS '87 Proceedings of the sixth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A study of transitive closure as a recursion mechanism

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
High-probability parallel transitive closure algorithms

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
On the equivalence of recursive and nonrecursive datalog programs

PODS '92 Proceedings of the eleventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Performance evaluation of algorithms for transitive closure

Information Systems
Bonded arity Datalog (≠) queries on graphs

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On Datalog vs. polynomial time

Journal of Computer and System Sciences
Inherent complexity of recursive queries

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies

Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies
On the Computation of the Transitive Closure of Relational Operators

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Reachability and Distance Queries via 2-Hop Labels

SIAM Journal on Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Dual Labeling: Answering Graph Reachability Queries in Constant Time

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Parallel complexity of logical query programs

SFCS '86 Proceedings of the 27th Annual Symposium on Foundations of Computer Science
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
Evaluating Reachability Queries over Path Collections

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Boom analytics: exploring data-centric, declarative programming for the cloud

Proceedings of the 5th European conference on Computer systems
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
GRAIL: scalable reachability index for large graphs

Proceedings of the VLDB Endowment
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Path-tree: An efficient reachability indexing scheme for large directed graphs

ACM Transactions on Database Systems (TODS)
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Fast computation of reachability labeling for large graphs

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Mining of Massive Datasets

Mining of Massive Datasets

Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
On implementing provenance-aware regular path queries with relational query engines
Making queries tractable on big data with preprocessing: through the eyes of complexity theory

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Implementing recursive algorithms on computing clusters presents a number of new challenges. In particular, we consider the endgame problem: later rounds of a recursion often transfer only small amounts of data, causing high overhead for interprocessor communication. One way to deal with the endgame problem is to use an algorithm that reduces the number of rounds of the recursion. Especially, in an application like transitive closure ("TC") there are several recursive-doubling algorithms that use a logarithmic, rather than linear, number of rounds. Unfortunately, recursive-doubling algorithms can deduce many more facts than the linear TC algorithms, which could negate the cost savings from the elimination of the overhead due to the proliferation of small files. We are thus led to consider TC algorithms that, like the linear algorithms, have the unique decomposition property that assures paths are discovered only once. We find that many such algorithms exist, and we show that they are incomparable, in that any of them could prove best on some data --- even lower in cost than the linear algorithms in some cases. The recursive-doubling approach to TC extends to other recursions as well. However, it is not acceptable to reduce the number of rounds at the expense of a major increase in the number of facts that are deduced. In this paper, we prove it is possible to implement any Datalog program of right-linear chain rules with a logarithmic number of rounds and no order-of-magnitude increase in the number of facts deduced. On the other hand, there are linear recursions for which the two goals of reducing the number of rounds and maintaining the total number of deduced facts cannot be met simultaneously. We show that the reachability problem cannot be solved in logarithmic rounds without using a binary predicate, thus squaring the number of potential facts to be deduced. We also show that the samegeneration recursion cannot be solved in logarithmic rounds without using a predicate of arity three.