Optimizing large-scale Semi-Naïve datalog evaluation in hadoop

Authors:
Marianne Shaw;Paraschos Koutris;Bill Howe;Dan Suciu
Affiliations:
University of Washington;University of Washington;University of Washington;University of Washington
Venue:
Datalog 2.0'12 Proceedings of the Second international conference on Datalog in Academia and Industry
Year:
2012

Citing 15
Cited 0

Issues in parallel execution of non-monotonic reasoning systems

Parallel Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
A parallel ASP instantiator based on DLV

Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
Boom analytics: exploring data-centric, declarative programming for the cloud

Proceedings of the 5th European conference on Computer systems
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
vSPARQL: A view definition language for the semantic web

Journal of Biomedical Informatics
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Dyna: extending datalog for modern AI

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
Recent advances in declarative networking

PADL'12 Proceedings of the 14th international conference on Practical Aspects of Declarative Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore the design and implementation of a scalable Datalog system using Hadoop as the underlying runtime system. Observing that several successful projects provide a relational algebra-based programming interface to Hadoop, we argue that a natural extension is to add recursion to support scalable social network analysis, internet traffic analysis, and general graph query. We implement semi-naive evaluation in Hadoop, then apply a series of optimizations spanning fundamental changes to the Hadoop infrastructure to basic configuration guidelines that collectively offer a 10x improvement in our experiments. This work lays the foundation for a more comprehensive cost-based algebraic optimization framework for parallel recursive Datalog queries.