Dimension reduction and visualization of large high-dimensional data via interpolation

Authors:
Seung-Hee Bae;Jong Youl Choi;Judy Qiu;Geoffrey C. Fox
Affiliations:
Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Indiana University, Bloomington IN
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 9
Cited 10

GTM: the generative topographic mapping

Neural Computation
SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm
A Scalable Generative Topographic Mapping for Sparse Data Sequences

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
The out-of-sample problem for classical multidimensional scaling

Computational Statistics & Data Analysis
Further Relaxations of the Semidefinite Programming Approach to Sensor Network Localization

SIAM Journal on Optimization
Glimmer: Multilevel MDS on the GPU

IEEE Transactions on Visualization and Computer Graphics
NP-hardness of Euclidean sum-of-squares clustering

Machine Learning
Embedding new data points for manifold learning via coordinate propagation

Knowledge and Information Systems
High Performance Dimension Reduction and Visualization for Large High-Dimensional Data Analysis

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

Browsing large scale cheminformatics data with dimension reduction

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Deterministic annealing and robust scalable data mining for the data deluge

Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
A general framework for dimensionality-reducing data visualization mapping

Neural Computation
DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Scalable parallel computing on clouds using Twister4Azure iterative MapReduce

Future Generation Computer Systems
Iterative statistical kernels on contemporary GPUs

International Journal of Computational Science and Engineering
Visualizing the protein sequence universe

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Lytic: synthesizing high-dimensional algorithmic analysis with domain-agnostic, faceted visual analytics

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
Analysis of electricity consumption profiles in public buildings with dimensionality reduction techniques

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The recent explosion of publicly available biology gene sequences and chemical compounds offers an unprecedented opportunity for data mining. To make data analysis feasible for such vast volume and high-dimensional scientific data, we apply high performance dimension reduction algorithms. It facilitates the investigation of unknown structures in a three dimensional visualization. Among the known dimension reduction algorithms, we utilize the multidimensional scaling and generative topographic mapping algorithms to configure the given high-dimensional data into the target dimension. However, both algorithms require large physical memory as well as computational resources. Thus, the authors propose an interpolated approach to utilizing the mapping of only a subset of the given data. This approach effectively reduces computational complexity. With minor trade-off of approximation, interpolation method makes it possible to process millions of data points with modest amounts of computation and memory requirement. Since huge amount of data are dealt, we represent how to parallelize proposed interpolation algorithms, as well. For the evaluation of the interpolated MDS by STRESS criteria, it is necessary to compute symmetric all pairwise computation with only subset of required data per process, so we also propose a simple but efficient parallel mechanism for the symmetric all pairwise computation when only a subset of data is available to each process. Our experimental results illustrate that the quality of interpolated mapping results are comparable to the mapping results of original algorithm only. In parallel performance aspect, those interpolation methods are well parallelized with high efficiency. With the proposed interpolation method, we construct a configuration of two-million out-of-sample data into the target dimension, and the number of out-of-sample data can be increased further.