Principal component analysis for distributed data sets with updating

Authors:
Zheng-Jian Bai;Raymond H. Chan;Franklin T. Luk
Affiliations:
Department of Mathematics, Chinese University of Hong Kong, Shatin, NT, Hong Kong, China;Department of Mathematics, Chinese University of Hong Kong, Shatin, NT, Hong Kong, China;Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York
Venue:
APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Year:
2005

Citing 3
Cited 2

The Spectral Decomposition of Nonsymmetric Matrices on Distributed Memory Parallel Computers

SIAM Journal on Scientific Computing
Distributed clustering using collective principal component analysis

Knowledge and Information Systems
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery

Decomposable principal component analysis

IEEE Transactions on Signal Processing
Distributed static linear Gaussian models using consensus

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying the patterns of large data sets is a key requirement in data mining. A powerful technique for this purpose is the principal component analysis (PCA). PCA-based clustering algorithms are effective when the data sets are found in the same location. In applications where the large data sets are physically far apart, moving huge amounts of data to a single location can become an impractical, or even impossible, task. A way around this problem was proposed in [10], where truncated singular value decompositions (SVDs) are computed locally and used to reduce the communication costs. Unfortunately, truncated SVDs introduce local approximation errors that could add up and would adversely affect the accuracy of the final PCA. In this paper, we introduce a new method to compute the PCA without incurring local approximation errors. In addition, we consider the situation of updating the PCA when new data arrive at the various locations.