An Architecture for Fast Processing of Large Unstructured Data Sets

Authors:
Mark Franklin;Roger Chamberlain;Michael Henrichs;Berkley Shands;Jason White
Affiliations:
Washington University in St. Louis;Washington University in St. Louis/ Data Search Systems, Inc., St. Louis, MO;Data Search Systems, Inc., St. Louis, MO;Washington University in St. Louis;Data Search Systems, Inc., St. Louis, MO
Venue:
ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Year:
2004

Citing 0
Cited 7

Biosequence Similarity Search on the Mercury System

Journal of VLSI Signal Processing Systems
Application development on hybrid systems

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Visions for application development on hybrid computing systems

Parallel Computing
FPGA: what's in it for a database?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Application-guided tool development for architecturally diverse computation

Proceedings of the 2010 ACM Symposium on Applied Computing
Auto-pipe and the X language: a pipeline design tool and description language

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Data deduplication in a hybrid architecture for improving write performance

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a general system architecture tailored to performing searching, filtering, compression, encryption, and other operations on unstructured data streaming from a disk system. The system achieves high performance on such applications by providing for parallelism, hardware-application specialization and reconfiguration, and hardware placement near the disk systems. A limited prototype of a single compute node has been implemented and is described. The prototype is tailored to applications involving complex searching and its performance is compared to a pure software implementation having the same search capabilities. Performance is considered in terms of data set size, query string hit rate and query complexity. Performance results as a function of these parameters are presented and the results indicate that, for data set sizes above 1.4 MB, the prototype compute node is between one and two orders of magnitude faster than a pure software implementation. At high data set sizes, on an individual node, speedups of about 200 and a sustained throughput of 300 MB/sec have been achieved.