User Defined Partitioning - Group Data Based on Computation Model

Authors:
Qiming Chen;Meichun Hsu
Affiliations:
HP Labs, Hewlett Packard Co., Palo Alto, USA;HP Labs, Hewlett Packard Co., Palo Alto, USA
Venue:
DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Year:
2008

Citing 15
Cited 1

Nested relation based database knowledge representation

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Parallel database systems: the future of high performance database systems

Communications of the ACM
A Teradata content-based multimedia object manager for massively parallel architectures

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Delivering high availability for Inktomi search engines

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OLAP-based Scalable Profiling of Customer Behavior

DaWaK '99 Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery
Dynamic Data Warehousing (abstract)

DaWaK '99 Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery
An OLAP-based Scalable Web Access Analysis Engine

DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
A Distributed OLAP Infrastructure for E-Commerce

COOPIS '99 Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems
A Data-Warehouse/OLAP Framework for Scalable Telecommunication Tandem Traffic Analysis

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scientific data management in the coming decade

ACM SIGMOD Record
Experiences with MapReduce, an abstraction for large-scale computation

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Building a scalable web query system

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems

Data-Continuous SQL Process Model

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:

Quantified Score

Hi-index	0.00

Visualization

Abstract

A technical trend in supporting large scale scientific applications is converging data intensive computation and data management for fast data access and reduced data flow. In a combined cluster platform, co-locating computation and data is the key to efficiency and scalability; and to make it happen, data must be partitioned in a way consistent with the computation model. However, with the current parallel database technology, data partitioning is primarily used to support flatparallel computing, and based on existing partition key values; for a given application, when the data scopes of function executions are determined by a high-level concept that is related to the application semantics but not presented in the original data, there would be no appropriate partition keys for grouping data.Aiming at making application-aware data partitioning, we introduce the notion of User Defined Data Partitioning (UDP). UDP differs from the usual data partitioning methods in that it does not rely on existing partition key values, but extracts or generates them from the original data in a labelingprocess. The novelty of UDP is allowing data partitioning to be based on application level concepts for matching the data access scoping of the targeted computation model, and for supporting data dependency graph based parallel computing.We applied this approach to architect a hydro-informatics system, for supporting periodical, near-real-time, data-intensive hydrologic computation on a database cluster. Our experimental results reveal its power in tightly coupling data partitioning with "pipelined" parallel computing in the presence of data processing dependencies.