Scalable mining on emerging architectures

  • Authors:
  • Srinivasan Parthasarathy;Gregory Buehrer

  • Affiliations:
  • The Ohio State University;The Ohio State University

  • Venue:
  • Scalable mining on emerging architectures
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent advances in data collection technology have generated large-scale data stores. Examples include the Internet, scientific simulation results, and government identification databases. Current estimates of the size of the Internet, which is a loosely structured public database, are at 20 billion pages and 250 billion links. Data mining is the process of discovering interesting and previously unknown information from such stores. Techniques effective on small data sets simply do not scale to larger data sets. Thus, as the ability to collect and store data increases, our ability to make use of the data degrades. This degradation is due to two main issues. First, the utility of efficient serial algorithms to mine such data is often lowered because this data is distributed on multiple machines. Second, since the complexity of most mining algorithms is polynomial or even exponential with the input size, even on parallel machines the runtimes exceed practical limitations. This dissertation addresses the concerns of mining large data stores by providing improvements to the state of the art in two directions. First, mining algorithms are restructured to glean the benefits of emerging commodity hardware. Second, we identify a set of useful patterns, and formulate algorithms for extracting them in log-linear time, enabling scalable performance on large datasets. We make several contributions towards data mining on emerging commodity hardware. First, we leverage the parallel disks, memory bandwidth and large number of processors to mine for exact global patterns in terascale distributed data sets. We provide a 10-fold improvement to the existing state of the art in distributed mining. Second, we leverage the improved computational throughput of emerging CMPs to provide nearly-linear scale-ups for shared-memory parallel structure mining. Third, we explore data mining on the Cell processor, re-architecting clustering, classification and outlier detection to glean significant runtime improvements. These improvements are afforded by the high oating point throughput of this emerging processor. We also show examples where the Cell processor requires more compute time than competing technologies, primarily due to its high latency when exchanging small chunks from main memory. We identify an important class of itemset patterns that can be extracted in log-linear time, improving significantly on the scalability afforded typically by frequent itemset mining algorithms. The algorithm proceeds by first hashing the global data set into partitions of highly similar items, and then mining the localized sets with a heuristic, single projection. We then show how this technique can be used to compress large graphs, improving on the state of the art in web compression. Finally, we leverage the techniques developed to build a general-purpose placement service for distributed data mining centers. The placement has low complexity and affords highly scalable solutions to many common data mining queries.