Parallel database processing on a 100 Node PC cluster: cases for decision support query processing and data mining

Authors:
Takayuki Tamura;Masato Oguchi;Masaru Kitsuregawa
Affiliations:
The University of Tokyo, 7-22-1 Roppongi, Minato-ku, Tokyo 106, Japan;The University of Tokyo, 7-22-1 Roppongi, Minato-ku, Tokyo 106, Japan;The University of Tokyo, 7-22-1 Roppongi, Minato-ku, Tokyo 106, Japan
Venue:
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Year:
1997

Citing 13
Cited 25

Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (SDC)

Proceedings of the sixteenth international conference on Very large databases
Tradeoffs in processing complex join queries via hashing in multiprocessor database machines

Proceedings of the sixteenth international conference on Very large databases
Parallel database systems: the future of high performance database systems

Communications of the ACM
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
High-performance sorting on networks of workstations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On searching transposed files

ACM Transactions on Database Systems (TODS)
Hash based parallel algorithms for mining association rules

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Benchmark Handbook: For Database and Transaction Processing Systems

Benchmark Handbook: For Database and Transaction Processing Systems
Query Execution for Large Relations on Functional Disk Systems

Proceedings of the Fifth International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Monet And Its Geographic Extensions: A Novel Approach to High Performance GIS Processing

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Communication overhead for space science applications on the Beowulf parallel workstation

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Commodity Clusters: Performance Comparison Between PC's and Workstations

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing

Dynamic remote memory acquisition for parallel data mining on ATM-connected PC cluster

ICS '99 Proceedings of the 13th international conference on Supercomputing
Towards self-tuning data placement in parallel database systems

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Query optimization for vector space problems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Web mining and its SQL based parallel execution

ITVE '01 Proceedings of the workshop on Information technology for virtual enterprises
Web community mining and web log mining: commodity cluster based execution

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
Optimizing transport protocol parameters for large scale PC cluster and its evaluation with parallel data mining

Cluster Computing
Web Log Mining and Parallel SQL Based Execution

DNIS '00 Proceedings of the International Workshop on Databases in Networked Information Systems
Web Mining Is Parallel

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Parallel Data Mining on ATM-Connected PC Cluster and Optimization of Its Execution Environments

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A PC-NOW Based Parallel Extension for a Sequential DBMS

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Parallel Data Mining on Large Scale PC Cluster

WAIM '00 Proceedings of the First International Conference on Web-Age Information Management
Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
OLAP Query Evaluation in a Database Cluster: A Performance Study on Intra-Query Parallelism

ADBIS '02 Proceedings of the 6th East European Conference on Advances in Databases and Information Systems
Mining Generalized Association Rule Using Parallel RDB Engine on PC Cluster

DaWaK '99 Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery
Parallel SQL Based Association Rule Mining on Large Scale PC Cluster: Performance Comparison with Directly Coded C Implementation

PAKDD '99 Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining
Performance Analysis for Parallel Generalized Association Rule Mining on a Large Scale PC Cluster

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Parallel Generalized Association Rule Mining on Large Scale PC Cluster

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
A Fast Convergence Technique for Online Heat-Balancing of Btree Indexed Database over Shared-Nothing Parallel Systems

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Pipelined operator tree scheduling in heterogeneous environments

Journal of Parallel and Distributed Computing
Practical Divisible Load Scheduling on Grid Platforms with APST-DV

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Multiround Algorithms for Scheduling Divisible Loads

IEEE Transactions on Parallel and Distributed Systems
Research works on cluster computing and storage area network

Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
Exploiting programmable network interfaces for parallel query execution in workstation clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Offloading bloom filter operations to network processor for parallel query processing in cluster of workstations

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Performance analysis of a parallel sort merge join on cluster architectures

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We developed a PC cluster system consists of 100 PCs. Each PC employs the 200MHz Pentium Pro CPU and is connected with others through an ATM switch. We picked up two kinds of data intensive applications. One is decision support query processing. And the other is data mining, specifically, association rule mining.As a high speed network, ATM technology has recently come to be a de facto standard. While other high performance network standards are also available, ATM networks are widely used from local area to widely distributed environments. One of the problems of the ATM networks is its high latencies, in contrast to their higher bandwidths. This is usually considered a serious flaw of ATM in composing high performance massively parallel processors. However, applications such as large scale database analyses are insensitive to the communication latency, requiring only the bandwidth.On the other hand, the performance of personal computers is increasing rapidly these days while the prices of PCs continue to fall at a much faster rate than workstations'. The 200MHz Pentium Pro CPU is competitive in integer performance to the processor chips found in workstations. Although it is still weak at floating point operations, they are not frequently used in database applications.Thus, by combining PCs and ATM switches we can construct a large scale parallel platform very easily and very inexpensively. In this paper, we examine how such a system can help the data warehouse processing, which currently runs on expensive high-end mainframes and/or workstation servers.In our first experiment, we used the most complex query of the standard benchmark, TPC-D, on a 100 GB database to evaluate the system compared with commercial parallel systems. Our PC cluster exhibited much higher performance compared with those in current TPC benchmark reports. Second, we parallelized association rule mining and ran large scale data mining on the PC cluster. Sufficiently high linearity was obtained. Thus we believe that such commodity based PC clusters will play a very important role in large scale database processing.