Hive: a warehousing solution over a map-reduce framework

Authors:
Ashish Thusoo;Joydeep Sen Sarma;Namit Jain;Zheng Shao;Prasad Chakka;Suresh Anthony;Hao Liu;Pete Wyckoff;Raghotham Murthy
Affiliations:
Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team;Facebook Data Infrastructure Team
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 2
Cited 148

SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Xbase: cloud-enabled information appliance for healthcare

Proceedings of the 13th International Conference on Extending Database Technology
An unobtrusive behavioral model of "gross national happiness"

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Boom analytics: exploring data-centric, declarative programming for the cloud

Proceedings of the 5th European conference on Computer systems
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Distributed indexing of web scale datasets for the cloud

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
SPARQL basic graph pattern processing with iterative MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An overview of the Open Science Data Cloud

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Massive Semantic Web data compression with MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
See spot run: using spot instances for mapreduce workflows

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
ESQP: an efficient SQL query processing for cloud data management

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Benchmarking cloud-based data management systems

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Comparing Hadoop and Fat-Btree based access method for small file I/O applications

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Merging file systems and data bases to fit the grid

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Multidimensional arrays for warehousing data on clouds

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Integrating MapReduce and RDBMSs

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Online querying of d-dimensional hierarchies

Journal of Parallel and Distributed Computing
Big data and cloud computing: current state and future opportunities

Proceedings of the 14th International Conference on Extending Database Technology
An overview of business intelligence technology

Communications of the ACM
A load-aware scheduler for MapReduce framework in heterogeneous cloud environments

Proceedings of the 2011 ACM Symposium on Applied Computing
A cloud-enabled regional climate model evaluation system

Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
An application architecture to facilitate multi-site clinical trial collaboration in the cloud

Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
New ideas track: testing mapreduce-style programs

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system

Journal of Parallel and Distributed Computing
ETLMR: a highly scalable dimensional ETL framework based on mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Data integration over NoSQL stores using access path based mappings

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Proceedings of the 2nd ACM Symposium on Cloud Computing
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Comparing high level mapreduce query languages

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Building cubes with MapReduce

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Scalable queries for large datasets using cloud computing: a case study

Proceedings of the 15th Symposium on International Database Engineering & Applications
Query optimization using column statistics in hive

Proceedings of the 15th Symposium on International Database Engineering & Applications
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Efficient processing of RDF graph pattern matching on MapReduce platforms

Proceedings of the second international workshop on Data intensive computing in the clouds
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Proceedings of the fifth ACM international conference on Web search and data mining
Executing multiple group by query using mapreduce approach: implementation and optimization

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
Social networking in developing regions

Proceedings of the Fifth International Conference on Information and Communication Technologies and Development
ReStore: reusing results of MapReduce jobs

Proceedings of the VLDB Endowment
Meeting service level objectives of Pig programs

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
Abstract state machines for data-parallel computing

Conceptual Modelling and Its Theoretical Foundations
The spread of emotion via facebook

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
What next?: a half-dozen data management research goals for big data and the cloud

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Declarative error management for robust data-intensive applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Oracle in-database hadoop: when mapreduce meets RDBMS

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Cost-benefit analysis of an SLA mapping approach for defining standardized Cloud computing goods

Future Generation Computer Systems
Optimizing Completion Time and Resource Provisioning of Pig Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Hybrid cloud support for large scale analytics and web processing

WebApps'12 Proceedings of the 3rd USENIX conference on Web Application Development
Cloud-Centric assured information sharing

PAISI'12 Proceedings of the 2012 Pacific Asia conference on Intelligence and Security Informatics
Towards a hybrid row-column database for a cloud-based medical data management system

Proceedings of the 1st International Workshop on Cloud Intelligence
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
HadoopRDF: a scalable semantic data analytical engine

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
M3R: increased performance for in-memory Hadoop jobs

Proceedings of the VLDB Endowment
The vertica analytic database: C-store 7 years later

Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Auto-parallelizing stateful distributed streaming applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
HEDC: a histogram estimator for data in the cloud

Proceedings of the fourth international workshop on Cloud data management
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
On-the-fly task execution for speeding up pipelined mapreduce

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Using clouds for MapReduce measurement assignments

ACM Transactions on Computing Education (TOCE)
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Scalable RDF data compression with MapReduce

Concurrency and Computation: Practice & Experience
Towards building a high performance spatial query system for large scale medical imaging data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Toward scalable internet traffic measurement and analysis with Hadoop

ACM SIGCOMM Computer Communication Review
Constructing a data accessing layer for in-memory data grid

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
SemanMR: big data processing framework based on semantics

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
MobiS: a distributed paradigm of mobile sensor data analytics for evaluating environmental exposures

Proceedings of the First ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems
Oozie: towards a scalable workflow management system for Hadoop

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Turbine: a distributed-memory dataflow engine for extreme-scale many-task applications

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Communication steps for parallel query processing

Proceedings of the 32nd symposium on Principles of database systems
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
BlinkDB: queries with bounded errors and bounded response times on very large data

Proceedings of the 8th ACM European Conference on Computer Systems
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
Exploiting in-network processing for big data management

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Early experiences in using a domain-specific language for large-scale graph analysis

First International Workshop on Graph Data Management Experiences and Systems
On benchmarking online social media analytical queries

First International Workshop on Graph Data Management Experiences and Systems
Reference representation techniques for large models

Proceedings of the Workshop on Scalability in Model Driven Engineering
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Efficient social network data query processing on MapReduce

Proceedings of the 5th ACM workshop on HotPlanet
EMF modeling in traffic surveillance experiments

Proceedings of the Modelling of the Physical World Workshop
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
MRPacker: an SQL to mapreduce optimizer

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Performance Modeling and Optimization of Deadline-Driven Pig Programs

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Demonstration of Hadoop-GIS: a spatial data warehousing system over MapReduce

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Representing mapreduce optimisations in the nested relational calculus

BNCOD'13 Proceedings of the 29th British National conference on Big Data
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
MR-runner: a modularized map-reduce job management tool

Proceedings of the 5th Asia-Pacific Symposium on Internetware
CRUCIBLE: towards unified secure on- and off-line analytics at scale

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
Scuba: diving into data at facebook

Proceedings of the VLDB Endowment
Unicorn: a system for searching the social graph

Proceedings of the VLDB Endowment
Medical data management in the SYSEO project

ACM SIGMOD Record
Efficient query evaluation on distributed graphs with Hadoop environment

Proceedings of the Fourth Symposium on Information and Communication Technology
Simplifying Scalable Graph Processing with a Domain-Specific Language

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Instant loading for main memory databases

Proceedings of the VLDB Endowment
Implementation of data affinity-based distributed parallel processing on a distributed key value store

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
Run-time performance optimization of a BigData query language

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Journal of Parallel and Distributed Computing
Exploiting inter-operation parallelism for matrix chain multiplication using MapReduce

The Journal of Supercomputing
SeaCloudDM: a database cluster framework for managing and querying massive heterogeneous sensor sampling data

The Journal of Supercomputing
Order matters! Harnessing a world of orderings for reasoning over massive data

Semantic Web
Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.02

Visualization

Abstract

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.