Query Processing in Parallel Relational Database Systems
Query Processing in Parallel Relational Database Systems
Optimizing data aggregation for cluster-based internet services
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed data-parallel computing using a high-level programming language
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Query interactions in database workloads
Proceedings of the Second International Workshop on Testing Database Systems
Inferring Dataflow Properties of User Defined Table Processors
SAS '09 Proceedings of the 16th International Symposium on Static Analysis
Scaling-Up and Speeding-Up Video Analytics Inside Database Engine
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Extend UDF Technology for Integrated Analytics
DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently support MapReduce-like computation models inside parallel DBMS
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM - Amir Pnueli: Ahead of His Time
Distributed aggregation for data-parallel computing: interfaces and implementations
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The nature of data center traffic: measurements & analysis
Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
Composing and executing parallel data-flow graphs with shell pipes
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Nephele: efficient parallel data processing in the cloud
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Using word-sense disambiguation methods to classify web queries by intent
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Cloud-TM: harnessing the cloud with distributed transactional memories
ACM SIGOPS Operating Systems Review
Harnessing input redundancy in a MapReduce framework
Proceedings of the 2010 ACM Symposium on Applied Computing
Towards scalable RDF graph analytics on MapReduce
Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
FlumeJava: easy, efficient data-parallel pipelines
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Stateful bulk processing for incremental analytics
Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing
Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
Indexing multi-dimensional data in a cloud system
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Large graph processing in the cloud
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Toward a cost-effective cloud storage service
ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
DryadInc: reusing work in large-scale computations
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Volley: automated data placement for geo-distributed cloud services
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ESQP: an efficient SQL query processing for cloud data management
CloudDB '10 Proceedings of the second international workshop on Cloud data management
A large scale ranker-based system for search query spelling correction
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Experience in extending query engine for continuous analytics
DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Comparing Hadoop and Fat-Btree based access method for small file I/O applications
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Merging file systems and data bases to fit the grid
Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Multidimensional arrays for warehousing data on clouds
Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study
Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
DataGarage: warehousing massive performance data on commodity servers
Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
The case for object databases in cloud data management
ICOODB'10 Proceedings of the Third international conference on Objects and databases
Online querying of d-dimensional hierarchies
Journal of Parallel and Distributed Computing
Map-reduce extensions and recursive queries
Proceedings of the 14th International Conference on Extending Database Technology
RanKloud: a scalable ranked query processing framework on hadoop
Proceedings of the 14th International Conference on Extending Database Technology
Dremel: interactive analysis of web-scale datasets
Communications of the ACM
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
CIEL: a universal execution engine for distributed data-flow computing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Sharing the data center network
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Parallel evaluation of conjunctive queries
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast personalized PageRank on MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Automated partitioning design in parallel database systems
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
G2: a graph processing system for diagnosing distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Regularized latent semantic indexing
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
GBASE: a scalable and general graph management system
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system
Journal of Parallel and Distributed Computing
CloudFuice: a flexible cloud-based data integration system
ICWE'11 Proceedings of the 11th international conference on Web engineering
ETLMR: a highly scalable dimensional ETL framework based on mapreduce
DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Mining large distributed log data in near real time
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Continuous data stream query in the cloud
Proceedings of the 20th ACM international conference on Information and knowledge management
Continuous access to cloud event services with event pipe queries
OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
Extend core UDF framework for GPU-enabled analytical query evaluation
Proceedings of the 15th Symposium on International Database Engineering & Applications
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
Case study of scientific data processing on a cloud using hadoop
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Scalable splitting of massive data streams
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
GLADE: a scalable framework for efficient analytics
ACM SIGOPS Operating Systems Review
Meeting service level objectives of Pig programs
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Jockey: guaranteed job latency in data parallel clusters
Proceedings of the 7th ACM european conference on Computer Systems
MadLINQ: large-scale distributed matrix computation for the cloud
Proceedings of the 7th ACM european conference on Computer Systems
PerfXplain: debugging MapReduce job performance
Proceedings of the VLDB Endowment
Abstract state machines for data-parallel computing
Conceptual Modelling and Its Theoretical Foundations
Mining for insights in the search engine query stream
Proceedings of the 21st international conference companion on World Wide Web
The HaLoop approach to large-scale iterative data analysis
The VLDB Journal — The International Journal on Very Large Data Bases
What next?: a half-dozen data management research goals for big data and the cloud
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Advanced partitioning techniques for massively distributed computation
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Declarative error management for robust data-intensive applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Recurring job optimization in scope
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Inside "Big Data management": ogres, onions, or parfaits?
Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce
Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries
Proceedings of the 15th International Conference on Extending Database Technology
Efficient SPARQL query processing in mapreduce through data partitioning and indexing
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Putting a "big-data" platform to good use: training kinect
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scope playback: self-validation in the cloud
DBTest '12 Proceedings of the Fifth International Workshop on Testing Database Systems
Optimizing Completion Time and Resource Provisioning of Pig Programs
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
FunSQL: it is time to make SQL functional
Proceedings of the 2012 Joint EDBT/ICDT Workshops
Opening the black boxes in data flow optimization
Proceedings of the VLDB Endowment
Spinning fast iterative data flows
Proceedings of the VLDB Endowment
Stream-join revisited in the context of epoch-based SQL continuous query
Proceedings of the 16th International Database Engineering & Applications Sysmposium
Optimization of analytic data flows for next generation business intelligence applications
TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Automated profiling and resource management of pig programs for meeting service level objectives
Proceedings of the 9th international conference on Autonomic computing
gbase: an efficient analysis platform for large graphs
The VLDB Journal — The International Journal on Very Large Data Bases
SCOPE: parallel databases meet MapReduce
The VLDB Journal — The International Journal on Very Large Data Bases
Spotting code optimizations in data-parallel pipelines through PeriSCOPE
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce
ACM Transactions on Database Systems (TODS)
Scripting distributed scientific workflows using Weaver
Concurrency and Computation: Practice & Experience
Coflow: a networking abstraction for cluster applications
Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Bridging the tenant-provider gap in cloud services
Proceedings of the Third ACM Symposium on Cloud Computing
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling
ACM Transactions on Information Systems (TOIS)
Cogset: a high performance MapReduce engine
Concurrency and Computation: Practice & Experience
Towards building a high performance spatial query system for large scale medical imaging data
Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Maguro, a system for indexing and searching over very large text collections
Proceedings of the sixth ACM international conference on Web search and data mining
Invisible loading: access-driven data transfer from raw files into database systems
Proceedings of the 16th International Conference on Extending Database Technology
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud
Proceedings of the 8th ACM European Conference on Computer Systems
Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Effective straggler mitigation: attack of the clones
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Rhea: automatic filtering for unstructured cloud storage
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Leveraging endpoint flexibility in data-intensive clusters
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
A characteristic study on failures of production distributed data-parallel programs
Proceedings of the 2013 International Conference on Software Engineering
Proceedings of the 22nd international conference on World Wide Web
Large-scale computation not at the cost of expressiveness
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Performance Modeling and Optimization of Deadline-Driven Pig Programs
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Scale-up vs scale-out for Hadoop: time to rethink?
Proceedings of the 4th annual Symposium on Cloud Computing
Apache Hadoop YARN: yet another resource negotiator
Proceedings of the 4th annual Symposium on Cloud Computing
MR-runner: a modularized map-reduce job management tool
Proceedings of the 5th Asia-Pacific Symposium on Internetware
UpSizeR: Synthetically scaling an empirical relational database
Information Systems
Hadoop GIS: a high performance spatial data warehousing system over mapreduce
Proceedings of the VLDB Endowment
Optimized data management for e-learning in the clouds towards Cloodle
Proceedings of the Fourth Symposium on Information and Communication Technology
The Journal of Supercomputing
Shroud: ensuring private access to large-scale data in the data center
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
JovianDATA: a multidimensional database for the cloud
Proceedings of the 17th International Conference on Management of Data
GRASS: trimming stragglers in approximation analytics
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.02 |
Companies providing cloud-scale services have an increasing need to store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically done on large clusters of shared-nothing commodity machines. It is imperative to develop a programming model that hides the complexity of the underlying system but provides flexibility by allowing users to extend functionality to meet a variety of requirements. In this paper, we present a new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis. The language is designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. SCOPE borrows several features from SQL. Data is modeled as sets of rows composed of typed columns. The select statement is retained with inner joins, outer joins, and aggregation allowed. Users can easily define their own functions and implement their own versions of operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs). SCOPE supports nesting of expressions but also allows a computation to be specified as a series of steps, in a manner often preferred by programmers. We also describe how scripts are compiled into efficient, parallel execution plans and executed on large clusters.