Cluster I/O with River: making the fast case common
Proceedings of the sixth workshop on I/O in parallel and distributed systems
Distributed processing of very large datasets with DataCutter
Parallel Computing - Clusters and computational grids for scientific computing
A Platform Independent Parallelising Tool Based on Graph Theoretic Models
VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Dynamic Querying of Streaming Data with the dQUOB System
IEEE Transactions on Parallel and Distributed Systems
Kepler: An Extensible System for Design and Execution of Scientific Workflows
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
IEEE Transactions on Knowledge and Data Engineering
Power Efficient Processor Architecture and The Cell Processor
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Language and compiler design for streaming applications
International Journal of Parallel Programming - Special issue: The next generation software program
International Journal of Hybrid Intelligent Systems
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Monitoring streams: a new class of data management applications
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
Real-Time Integration of Geospatial Raster and Point Data Streams
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Workflows and e-Science: An overview of workflow system features and capabilities
Future Generation Computer Systems
A distributed architecture for data mining and integration
Proceedings of the second international workshop on Data-aware distributed computing
Automating Gene Expression Annotation for Mouse Embryo
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM - Amir Pnueli: Ahead of His Time
GridMiner: an advanced grid-based support for brain informatics data mining tasks
WImBI'06 Proceedings of the 1st WICI international conference on Web intelligence meets brain informatics
Data-intensive architecture for scientific knowledge discovery
Distributed and Parallel Databases
Accelerating Biomedical Data-Intensive Applications Using MapReduce
GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Semantics and provenance for processing element composition in dispel workflows
WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
Hi-index | 0.01 |
To facilitate data mining and integration (DMI) processes in a generic way, we investigate a parallel pipeline streaming model. We model a DMI task as a streaming data-flow graph: a directed acyclic graph (DAG) of Processing Elements (PEs). The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data-flows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, which provide room for performance enhancement. We have applied this approach to a real DMI case in the life sciences and implemented a prototype. To demonstrate feasibility of the modelled DMI task and assess the efficiency of the prototype, we have also built a performance evaluation model. The experimental evaluation results show that a linear speedup has been achieved with the increase of the number of distributed computing nodes in this case study.