A generic parallel processing model for facilitating data mining and integration

Authors:
Liangxiu Han;Chee Sun Liew;Jano van Hemert;Malcolm Atkinson
Affiliations:
School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK;School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK and Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Mala ...;School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK;School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
Venue:
Parallel Computing
Year:
2011

Citing 25
Cited 3

A Survey of Parallel Computer Architectures

Computer
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Distributed processing of very large datasets with DataCutter

Parallel Computing - Clusters and computational grids for scientific computing
A Platform Independent Parallelising Tool Based on Graph Theoretic Models

VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Dynamic Querying of Streaming Data with the dQUOB System

IEEE Transactions on Parallel and Distributed Systems
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
GATES: A Grid-Based Middleware for Processing Distributed Data Streams

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Language and compiler design for streaming applications

International Journal of Parallel Programming - Special issue: The next generation software program
A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

International Journal of Hybrid Intelligent Systems
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Monitoring streams: a new class of data management applications

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
Real-Time Integration of Geospatial Raster and Point Data Streams

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Workflows and e-Science: An overview of workflow system features and capabilities

Future Generation Computer Systems
A distributed architecture for data mining and integration

Proceedings of the second international workshop on Data-aware distributed computing
Automating Gene Expression Annotation for Mouse Embryo

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
GridMiner: an advanced grid-based support for brain informatics data mining tasks

WImBI'06 Proceedings of the 1st WICI international conference on Web intelligence meets brain informatics

Data-intensive architecture for scientific knowledge discovery

Distributed and Parallel Databases
Accelerating Biomedical Data-Intensive Applications Using MapReduce

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Semantics and provenance for processing element composition in dispel workflows

WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science

Quantified Score

Hi-index	0.01

Visualization

Abstract

To facilitate data mining and integration (DMI) processes in a generic way, we investigate a parallel pipeline streaming model. We model a DMI task as a streaming data-flow graph: a directed acyclic graph (DAG) of Processing Elements (PEs). The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data-flows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, which provide room for performance enhancement. We have applied this approach to a real DMI case in the life sciences and implemented a prototype. To demonstrate feasibility of the modelled DMI task and assess the efficiency of the prototype, we have also built a performance evaluation model. The experimental evaluation results show that a linear speedup has been achieved with the increase of the number of distributed computing nodes in this case study.